Monday, 26 October 2020

Wrong value in text comparison

I am having some difficulties in finding text matching in the below dataset (note that Sim is my current output and it is generated by running the code below. It shows the wrong match).

    ID      Text                                                   Sim
13  fsad    amazing  ...                                           fsd
14  fdsdf   best sport everand the gane of the year❤️❤️❤️❤️...     fdsfdgte3e
18  gsd     wonderful                                              fast 
21  dfsfs   i love this its incredible ...                         reds
23  gwe     wonderful end ever seen you ...                        add
... ... ... ...
261 add     wonderful                                              gwe
261 add     wonderful                                              gsd
261 add     wonderful                                              fdsdf
267 fdsfdgte3e  best match ever its a masterpiece                  fdsdf
277 hgdfgre terrible destroys everything ...                       tm28

As shown above, Sim does not give the ID who wrote the text that match. For example, add should match with gsd and vice versa. But my output says that add matches with gwe and this is not true.

The code I am using is the following:

    from fuzzywuzzy import fuzz
    
        def sim (nm, df): # this function finds matches between texts based on a threshold, which is 100. The logic is fuzzywuzzy, specifically partial ratio. The output should be IDs whether texts match, based on the threshold.
            matches = dataset.apply(lambda row: ((fuzz.partial_ratio(row['Text'], nm)) = 100), axis=1)
            return [df.ID[i] for i, x in enumerate(matches) if x]
    
    df['L_Text']=df['Text'].str.lower() 
    df['Sim']=df.apply(lambda row: sim(row['L_Text'], df), axis=1)
    df=df.assign(
        Sim = df.apply(lambda x: [s for s in x['Sim'] if s != x['ID']], axis=1)
    )

def tr (row): # this function assign a similarity score for each text applying partial_ratio similarity
    return (df.loc[:row.name-1, 'L_Text']
                    .apply(lambda name: fuzz.partial_ratio(name, row['L_Text'])))

t = (df.loc[1:].apply(tr, axis=1)
         .reindex(index=df.index, 
                  columns=df.index)
         .fillna(0)
         .add_prefix('txt')
     )
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))

Could you please help me understand the error in my code? Unfortunately I cannot see it.

My expected output would be as follows:

ID      Text                                                   Sim
13  fsad    amazing  ...                                          
14  fdsdf   best sport everand the gane of the year❤️❤️❤️❤️...    
18  gsd     wonderful                                              add 
21  dfsfs   i love this its incredible ...                         
23  gwe     wonderful end ever seen you ...                       
... ... ... ...
261 add     wonderful                                              gsd
261 add     wonderful                                              gsd
261 add     wonderful                                              gsd
267 fdsfdgte3e  best match ever its a masterpiece                 
277 hgdfgre terrible destroys everything ... 

                 

as it is set a perfect match (=1) in sim function.



from Wrong value in text comparison

No comments:

Post a Comment