Sunday, 7 November 2021

Find similar expressions without knowing them in multiple texts with Python3

In a folder of text files, I would like to extract all similar groups of words or similar expressions between files. I do not know the words in advance. The expressions string length are between 2 and n words.

The ideal output would be a dictionary with the frequency of the pattern and the list of files containing it.

Is it possible to do it with the difflib library ? If not, could you suggest one I can look for ?

Examples :

text_file1 = """Lorem ipsum dolor sit amet, connectum adipiscing elit. 
Sed do eisumodus temporis incididunt ut labore dororis magnum alida."""

text_file2 = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."""

text_file3 = """Lorem ipsum dolor sit amet, consectetor adipiscat elit, 
sid du eisumodos tempor incididunt at laboris et dolor magnum aliquos."""
output = {
"Lorem ipsum dolor sit amet":
    "freq" : 3,
    "files" : ['text_file1','text_file2','text_file3'],
"adipiscing elit":
    "freq" : 2,
    "files" : ['text_file1','text_file2'],
"sed do":
    "freq" : 2,
    "files" : ['text_file1','text_file2'],
"tempor incididunt":
    "freq" : 2,
    "files" : ['text_file2','text_file3'],
"incididunt ut labore":
    "freq" : 2,
    "files" : ['text_file1','text_file2'],
}


from Find similar expressions without knowing them in multiple texts with Python3

No comments:

Post a Comment