Tuesday, 29 December 2020

Calculate cosine similarity for between all cases in a dataframe fast

I'm working on an NLP project where I have to compare the similarity between many sentences E.G. from this dataframe:

enter image description here

The first thing I tried was to make a join of the dataframe with itself to get the bellow format and compare row by row:

enter image description here

The problem with this that I get out of memory quickly for big medium/big datasets, e.g. for a 10k rows join I will get 100MM rows which I can not fit in ram

My current aproach is to iterate over the dataframe with as:

final = pd.DataFrame()

### for each row 
for i in range(len(df_sample)):

    ### select the corresponding vector to compare with 
    v =  df_sample[df_sample.index.isin([i])]["use_vector"].values
    ### compare all cases agains the selected vector
    df_sample.apply(lambda x:  cosine_similarity_numba(x.use_vector,v[0])  ,axis=1)

    ### kept the cases with a similarity over a given th, in this case 0.6
    temp = df_sample[df_sample.apply(lambda x:  cosine_similarity_numba(x.use_vector,v[0])  ,axis=1) > 0.6]  
    ###  filter out the base case 
    temp = temp[~temp.index.isin([i])]
    temp["original_question"] = copy.copy(df_sample[df_sample.index.isin([i])]["questions"].values[0])
    ### append the result     
    final = pd.concat([final,temp])

But this aproach is not fast either. How can I improve the performance of this process?



from Calculate cosine similarity for between all cases in a dataframe fast

No comments:

Post a Comment