I am looking to implement a streaming json parser for a very, very large JSON file (~ 1TB) that I'm unable to load into memory. One option is to use something like https://github.com/stedolan/jq to convert the file into json-newline-delimited, but there are various other things I need to do to each json object, that makes this approach not ideal.
Given a very large json object, how would I be able to parse it object-by-object, similar to this approach in xml: https://www.ibm.com/developerworks/library/x-hiperfparse/index.html.
For example, in pseudocode:
with open('file.json','r') as f:
json_str = ''
for line in f: # what if there are no newline in the json obj?
json_str += line
if is_valid(json_str):
obj = json.loads(json_str)
do_something()
json_str = ''
Additionally, I did not find jq -c
to be particularly fast (ignoring memory considerations). For example, doing json.loads
was just as fast (and a bit faster) than using jq -c
. I tried using ujson
as well, but kept getting a corruption error which I believe was related to the file size.
# file size is 2.2GB
>>> import json,time
>>> t0=time.time();_=json.loads(open('20190201_itunes.txt').read());print (time.time()-t0)
65.6147990227
$ time cat 20190206_itunes.txt|jq -c '.[]' > new.json
real 1m35.538s
user 1m25.109s
sys 0m15.205s
Finally, here is an example 100KB json input which can be used for testing: https://hastebin.com/ecahufonet.json
from Streaming json parser
No comments:
Post a Comment