I have a BZ2 file of more than 10GB. I'd like to read it without decompressing it into a temporary file (it would be more than 50GB).
With this method:
import bz2, time
t0 = time.time()
time.sleep(0.001) # to avoid / by 0
with bz2.open("F:\test.bz2", 'rb') as f:
for i, l in enumerate(f):
if i % 100000 == 0:
print('%i lines/sec' % (i/(time.time() - t0)))
I can only read ~ 250k lines per second. On a similar file, first decompressed, I get ~ 3M lines per second, i.e. a x10 factor:
with open("F:\test.txt", 'rb') as f:
I think it's not only due to the intrinsic decompression CPU time (because the total time of decompression into a temp file + the reading as uncompressed file is much smaller than the method described here), but maybe a lack of buffering, or other reasons. Are there other faster Python implementations of bz2.open
?
How to speed up the reading of a BZ2 file, in binary mode, and loop over "lines"? (separated by \n
)
Note: currently time to decompress test.bz2 into test.tmp + time to iterate over lines of test.tmp
is far smaller than time to iterate over lines of bz2.open('test.bz2')
, and this probably should not be the case.
Linked topic: https://discuss.python.org/t/non-optimal-bz2-reading-speed/6869
from Speed up reading in a compressed bz2 file ('rb' mode)
No comments:
Post a Comment