Hemant Vishwakarma: How to detect an XML schema

Monday, 17 December 2018

How to detect an XML schema

I have a very large feed file that is sent as an XML document (5GB). What would be the fastest way to parse the structure of the main item node without previously knowing its structure? Is there a python utility that can do so 'on-the-fly' without having the complete xml loaded in memory? For example, what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?

Update: I've included an example XML fragment here: https://hastebin.com/muqudasacu.xml. I'm looking to extract something like a dataframe (or list or whatever other data structure you want to use) similar to the following:

Items/Item/Main/Platform       Items/Item/Info/Name
iTunes                         Chuck Versus First Class
iTunes                         Chuck Versus Bo

How could this be done? I've added a bounty to encourage answers here.

from How to detect an XML schema

Hemant Vishwakarma

Monday, 17 December 2018

How to detect an XML schema

No comments:

Post a Comment