Friday, 29 January 2021

Speed up reading in a compressed bz2 file ('rb' mode)

I have a BZ2 file of more than 10GB. I'd like to read it without decompressing it into a temporary file (it would be more than 50GB).

With this method:

import bz2, time
t0 = time.time()
time.sleep(0.001) # to avoid / by 0
with bz2.open("F:\test.bz2", 'rb') as f:
    for i, l in enumerate(f):
        if i % 100000 == 0:
            print('%i lines/sec' % (i/(time.time() - t0)))

I can only read ~ 250k lines per second. On a similar file, first decompressed, I get ~ 3M lines per second, i.e. a x10 factor:

with open("F:\test.txt", 'rb') as f:

I think it's not only due to the intrinsic decompression CPU time (because the total time of decompression into a temp file + the reading as uncompressed file is much smaller than the method described here), but maybe a lack of buffering, or other reasons. Are there other faster Python implementations of bz2.open?

How to speed up the reading of a BZ2 file, in binary mode, and loop over "lines"? (separated by \n)

Note: currently time to decompress test.bz2 into test.tmp + time to iterate over lines of test.tmp is far smaller than time to iterate over lines of bz2.open('test.bz2'), and this probably should not be the case.

Linked topic: https://discuss.python.org/t/non-optimal-bz2-reading-speed/6869



from Speed up reading in a compressed bz2 file ('rb' mode)

No comments:

Post a Comment