Saturday, 9 February 2019

Streaming json parser

I am looking to implement a streaming json parser for a very, very large JSON file (~ 1TB) that I'm unable to load into memory. One option is to use something like https://github.com/stedolan/jq to convert the file into json-newline-delimited, but there are various other things I need to do to each json object, that makes this approach not ideal.

Given a very large json object, how would I be able to parse it object-by-object, similar to this approach in xml: https://www.ibm.com/developerworks/library/x-hiperfparse/index.html.

For example, in pseudocode:

with open('file.json','r') as f:
    json_str = ''
    for line in f: # what if there are no newline in the json obj?
        json_str += line
        if is_valid(json_str):
            obj = json.loads(json_str)
            do_something()
            json_str = ''

Additionally, I did not find jq -c to be particularly fast (ignoring memory considerations). For example, doing json.loads was just as fast (and a bit faster) than using jq -c. I tried using ujson as well, but kept getting a corruption error which I believe was related to the file size.

# file size is 2.2GB
>>> import json,time
>>> t0=time.time();_=json.loads(open('20190201_itunes.txt').read());print (time.time()-t0)
65.6147990227

$ time cat 20190206_itunes.txt|jq -c '.[]' > new.json
real    1m35.538s
user    1m25.109s
sys 0m15.205s

Finally, here is an example 100KB json input which can be used for testing: https://hastebin.com/ecahufonet.json



from Streaming json parser

No comments:

Post a Comment