I have a very large feed file that is sent as an XML document (5GB). What would be the fastest way to parse the structure of the main item node without previously knowing its structure? Is there a python utility that can do so 'on-the-fly' without having the complete xml loaded in memory? For example, what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
Update: I've included an example XML fragment here: https://hastebin.com/muqudasacu.xml. I'm looking to extract something like a dataframe (or list or whatever other data structure you want to use) similar to the following:
Items/Item/Main/Platform Items/Item/Info/Name
iTunes Chuck Versus First Class
iTunes Chuck Versus Bo
How could this be done? I've added a bounty to encourage answers here.
from How to detect an XML schema
No comments:
Post a Comment