I have the following dataframe:
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
I want to identify similar names in name
column if those names belong to one cluster number and create unique id for them. For example South Beach
and Beach
belong to cluster number 1
and their similarity score is pretty high. So we associate it with unique id, say 1
. Next cluster is number 2
and three entities from name
column belong to this cluster: Dog
, Big Dog
and Cat
. Dog
and Big Dog
have high similarity score and their unique id will be, say 2
. For Cat
unique id will be, say 3
. And so on.
I created a code for the logic above:
# pip install thefuzz
from thefuzz import fuzz
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test['id'] = 0
i = 1
for index, row in df_test.iterrows():
for index_, row_ in df_test.iterrows():
if row['cluster_number'] == row_['cluster_number'] and row_['id'] == 0:
if fuzz.ratio(row['name'], row_['name']) > 50:
df_test.loc[index_,'id'] = int(i)
is_i_used = True
if is_i_used == True:
i += 1
is_i_used = False
Code generates expected result:
name cluster_number id
0 South Beach 1 1
1 Dog 2 2
2 Bird 3 3
3 Ant 3 4
4 Big Dog 2 2
5 Beach 1 1
6 Dear 4 5
7 Cat 2 6
Note, for Cat
we got id
as 6
but it is fine because it is unique anyway.
While algorithm above works for test data I am not able to use it for real data that I have (about 1 million rows) and I am trying to understand how to vectorize the code and get rid of two for-loops.
Also thefuzz
module has process
function and it allows to process data at once:
from thefuzz import process
out = process.extract("Beach", df_test['name'], limit=len(df_test))
But I don't see if it can help with speeding up the code.
from How to vectorize and speed-up double for-loop for pandas dataframe when doing text similarity scoring
No comments:
Post a Comment