Friday, 2 July 2021

Find similarties using nlp/spacy

I have a large dataframe to compare with another dataframe and correct the id. I'm gonna illustrate my problem into this simple exp.

import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})

df = pd.DataFrame({'id':['nan','nan','nan'],
                    'description':['JOHN HAS 25 YEAR OLD LIVES IN At/12','STEVE has  50 OLD LIVES IN At.14','ALICIE HAS 10 YEAR OLD LIVES IN AT13']})
print(df)

df1 = pd.DataFrame({'id':[1203,1205,1045],
                    'description':['JOHN HAS 25year OLD LIVES IN At 2','STEVE has  50year OLD LIVES IN At 14','ALICIE HAS 10year OLD LIVES IN At 13']})
print(df1)
age = ["50year", "25year", "10year"]
for a in age:
    ruler.add_patterns([{"label": "age", "pattern": a}])

names = ["JOHN", "STEVE", "ALICIA"]
for n in names:
    ruler.add_patterns([{"label": "name", "pattern": n}])

ref = ["AT 2", "At 13", "At 14"]
for r in ref:
    ruler.add_patterns([{"label": "ref", "pattern": r}])
#exp to check text difference
doc = nlp("JOHN has 25 YEAR OLD LIVES IN At.12 ")
for ent in doc.ents:
    print(ent, ent.label_)

Actually there is a difference in the text of the two dataframe df and df1 which is the reference, as shown in the picture bellow

enter image description here I dont know how to get similarties 100% in this case. I tried to use spacy but i dont how to fix difference and correct the id in df.

This is my dataframe1:

   id                           description
0  nan      STEVE has 50 OLD LIVES IN At.14
1  nan   JOHN HAS 25 YEAR OLD LIVES IN At/12
2  nan  ALICIE HAS 10 YEAR OLD LIVES IN AT15

This my reference dataframe:

     id                           description
0  1203   STEVEN HAS 25year OLD lives in At 6
1  1205     JOHN HAS 25year OLD LIVES IN At 2
2  1045  ALICIE HAS 50year OLD LIVES IN At 13
3  3045   STEVE HAS 50year OLD LIVES IN At 14
4  3465  ALICIE HAS 10year OLD LIVES IN At 13

My expected output:

     id                           description
0   3045     STEVE has 50 OLD LIVES IN At.14
1   1205     JOHN HAS 25 YEAR OLD LIVES IN At/12
2   3465     ALICIE HAS 10year OLD LIVES IN AT15

NB:The sentences are not in the same order / The dataframes don't have equal length



from Find similarties using nlp/spacy

No comments:

Post a Comment