In a folder of text files, I would like to extract all similar groups of words or similar expressions between files. I do not know the words in advance. The expressions string length are between 2 and n words.
The ideal output would be a dictionary with the frequency of the pattern and the list of files containing it.
Is it possible to do it with the difflib library ? If not, could you suggest one I can look for ?
Examples :
text_file1 = """Lorem ipsum dolor sit amet, connectum adipiscing elit.
Sed do eisumodus temporis incididunt ut labore dororis magnum alida."""
text_file2 = """Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."""
text_file3 = """Lorem ipsum dolor sit amet, consectetor adipiscat elit,
sid du eisumodos tempor incididunt at laboris et dolor magnum aliquos."""
output = {
"Lorem ipsum dolor sit amet":
"freq" : 3,
"files" : ['text_file1','text_file2','text_file3'],
"adipiscing elit":
"freq" : 2,
"files" : ['text_file1','text_file2'],
"sed do":
"freq" : 2,
"files" : ['text_file1','text_file2'],
"tempor incididunt":
"freq" : 2,
"files" : ['text_file2','text_file3'],
"incididunt ut labore":
"freq" : 2,
"files" : ['text_file1','text_file2'],
}
from Find similar expressions without knowing them in multiple texts with Python3
No comments:
Post a Comment