Friday, 13 May 2022

Random access in a 7z file

I have a 100 GB text file in a 7z archive. I can find a pattern 'hello' in it by reading it by 1 MB block (7z outputs the data to stdout):

Popen("7z e -so archive.7z big100gb_file.txt", stdout=PIPE)
while True:
    block = proc.stdout.read(1024*1024)    # 1 MB block
    i += 1
    ...
    if b'hello' in block:      # omitting other details for search pattern split in consecutive blocks...
        print('pattern found in block %i' % i)
    ...

Now that we have found after 5 minutes of search that the pattern 'hello' is, say, in the 23456th block, how to access this block or line very fast in the future inside the 7z file?

(without saving this data in another file, which I don't want - indexing is the goal, not duplication of data)

With 7z, how to seek in the middle of the file?

Note: I already read Indexing / random access to 7zip .7z archives and random seek in 7z single file archive but these questions don't discuss concrete implementation.



from Random access in a 7z file

No comments:

Post a Comment