I have the following code for similarity scoring:
from rapidfuzz import process, fuzz
import pandas as pd
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish', 'Fish'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1), columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.reset_index()
.explode("name"))
groups.rename(columns={'index': 'restaurant_id'}, inplace=True)
groups.restaurant_id += 1
df_test = df_test.merge(groups, how="left")
This code represents efficient and vectorized method for similarity scoring. It works perfectly for small data sets but when I try a dataframe with 1 million rows I get a memoryError
for function rapidfuzz.process.cdist(...)
. As mention in comment section bellow this function returns a matrix of len(queries) x len(choices) x size(dtype). By default this dtype is float or int32_t depending on the scorer (for the default scorer you are using it is float). So for 1 million names, the result matrix would require about 4 terabytes of memory. My PC has 12GB of free RAM space but it is not near enough. Any ideas how to avoid overloading RAM but keep computation in vectorized form?
from How to do effective matrix computation and not get memory overload for similarity scoring?
No comments:
Post a Comment