Hemant Vishwakarma: Processing large xml files. Only root tree children attributes are relevant

Thursday 22 July 2021

Processing large xml files. Only root tree children attributes are relevant

I'm new to xml and python and I hope that I phrased my problem right:

I have xml files with a size of one gigabyte. The files look like this:

<test name="LongTestname" result="PASS">
    <step ID="0" step="NameOfStep1" result="PASS">
        Stuff I dont't care about
    </step>
    <step ID="1" step="NameOfStep2" result="PASS">
        Stuff I dont't care about
    </step>
</test>

For fast analysis I want to get the name and the result of the steps which are the children of the root element. Stuff I dont't care about are lots of nested elements.

I have already tried following:

tree = ET.parse(xmlLocation)
root = tree.getroot()
for child in root:
    print(child.tag, child.attrib)

Here I get a memory error because the files are to big

Then I tried:

try:
    for event, elem in ET.iterparse(pathToSteps, events=("start","end")):
       if elem.tag == "step" and event == "start":
                        
           stepAndResult.append([elem.attrib['step'],elem.attrib['result'],"System1"])
       elem.clear()

This works but is really slow. I guess it iterates through all elements and this takes a very long time.

Then I found a solution looking like this:

try:
    tree = ET.iterparse(pathToSteps, events=("start","end"))
    _, root = next(tree)  
    print('ROOT:', root.tag)
except:
   print("ERROR: Unable to open and parse file !!!")


for child in root:
   print(child.attrib)

But this prints only the attributes of the first step.

Is there a way to speed up the working solution? Since I'm pretty new to this stuff I would appreciate a complete example or a reference where I can figure it out by myself with an example.

from Processing large xml files. Only root tree children attributes are relevant

Hemant Vishwakarma

Thursday 22 July 2021

Processing large xml files. Only root tree children attributes are relevant

No comments:

Post a Comment