Friday 6 November 2020

How can Dataframe.join be accelerated?

I have 2 text files (*.txt) that contain unique strings in the format:

udtvbacfbbxfdffzpwsqzxyznecbqxgebuudzgzn:refmfxaawuuilznjrxuogrjqhlmhslkmprdxbascpoxda
ltswbjfsnejkaxyzwyjyfggjynndwkivegqdarjg:qyktyzugbgclpovyvmgtkihxqisuawesmcvsjzukcbrzi

The first file contains 50 million such lines (4.3 GB), and the second contains 1 million lines (112 MB). One line contains 40 characters, delimiter : and 45 more characters.

Task: get unique values for both files. That is, you need a csv or txt file with lines that are in the second file and which are not in the first.

I am trying to do this using vaex (Vaex):

import vaex

base_files = ['file1.txt']
for i, txt_file in enumerate(base_files, 1):
    for j, dv in enumerate(vaex.from_csv(txt_file, chunk_size=5_000_000, names=['data']), 1):
        dv.export_hdf5(f'hdf5_base/base_{i:02}_{j:02}.hdf5')

check_files = ['file2.txt']
for i, txt_file in enumerate(check_files, 1):
    for j, dv in enumerate(vaex.from_csv(txt_file, chunk_size=5_000_000, names=['data']), 1):
        dv.export_hdf5(f'hdf5_check/check_{i:02}_{j:02}.hdf5')

dv_base = vaex.open('hdf5_base/*.hdf5')
dv_check = vaex.open('hdf5_check/*.hdf5')
dv_result = dv_check.join(dv_base, on='data', how='inner', inplace=True)
dv_result.export(path='result.csv')

As a result, I get the result.csv file with unique row values. But the verification process takes a very long time. In addition, it uses all available RAM and all processor resources. How can this process be accelerated? What am I doing wrong? What can be done better? Is it worth using other libraries (pandas, dask) for this check and will they be faster?



from How can Dataframe.join be accelerated?

No comments:

Post a Comment