Thursday 22 July 2021

Processing large xml files. Only root tree children attributes are relevant

I'm new to xml and python and I hope that I phrased my problem right:

I have xml files with a size of one gigabyte. The files look like this:

<test name="LongTestname" result="PASS">
    <step ID="0" step="NameOfStep1" result="PASS">
        Stuff I dont't care about
    </step>
    <step ID="1" step="NameOfStep2" result="PASS">
        Stuff I dont't care about
    </step>
</test>

For fast analysis I want to get the name and the result of the steps which are the children of the root element. Stuff I dont't care about are lots of nested elements.

I have already tried following:

tree = ET.parse(xmlLocation)
root = tree.getroot()
for child in root:
    print(child.tag, child.attrib)

Here I get a memory error because the files are to big

Then I tried:

try:
    for event, elem in ET.iterparse(pathToSteps, events=("start","end")):
       if elem.tag == "step" and event == "start":
                        
           stepAndResult.append([elem.attrib['step'],elem.attrib['result'],"System1"])
       elem.clear()

This works but is really slow. I guess it iterates through all elements and this takes a very long time.

Then I found a solution looking like this:

try:
    tree = ET.iterparse(pathToSteps, events=("start","end"))
    _, root = next(tree)  
    print('ROOT:', root.tag)
except:
   print("ERROR: Unable to open and parse file !!!")


for child in root:
   print(child.attrib)

But this prints only the attributes of the first step.

Is there a way to speed up the working solution? Since I'm pretty new to this stuff I would appreciate a complete example or a reference where I can figure it out by myself with an example.



from Processing large xml files. Only root tree children attributes are relevant

No comments:

Post a Comment