I am using Motor for async MongoDB operations. I have a gridfs storage where I store large XML files (typically 30+ MB in size) in chunks of 8 MBs. I want to incrementally parse the XML file using xmltodict. Here is how my code looks.
async def read_file(file_id):
gfs_out: AsyncIOMotorGridOut = await gfs_bucket.open_download_stream(file_id)
tmpfile = tempfile.SpooledTemporaryFile(mode="w+b")
while data := await gfs_out.readchunk():
tmpfile.write(data)
xmltodict.parse(tmpfile)
I am pulling all the chunks out one by one and storing them in a temporary file in memory and then parsing the entire file through xmltodict. Ideally I would want toparse it incrementally as I don't need the entire xml object from the get go.
The documentation for xmltodict suggests that we can add custom handlers to parse a stream, like this example:
>>> def handle_artist(_, artist):
... print(artist['name'])
... return True
>>>
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
... item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...
But the problem with this is that it expects a file-like object with a synchronous read()
method, not a coroutine. Is there any way it can be achieved? Any help would be greatly appreciated.
from Stream large XML file directly from GridFS to xmltodict parsing
No comments:
Post a Comment