I have a list of sentences such as this:
errList = [ 'Ragu ate lunch but didnt have Water for drinks',
'Rams ate lunch but didnt have Gatorade for drinks',
'Saya ate lunch but didnt have :water for drinks',
'Raghu ate lunch but didnt have water for drinks',
'Hanu ate lunch but didnt have -water for drinks',
'Wayu ate lunch but didnt have water for drinks',
'Viru ate lunch but didnt have .water 4or drinks',
'kk ate lunch & icecream but did have Water for drinks',
'M ate lunch &and icecream but did have Gatorade for drinks',
'Parker ate lunch icecream but didnt have :water for drinks',
'Sassy ate lunch and icecream but didnt have water for drinks',
'John ate lunch and icecream but didnt have -water for drinks',
'Pokey ate lunch and icecream but didnt have Water for drinks',
'Laila ate lunch and icecream but did have water 4or drinks',
]
I want to find out count of longest phrases/part of sentences in each element of list? In following example, output will look closer to this (longest phrase as key and count as value):
{ 'ate lunch but didnt have': 7,
'water for drinks': 7,
'ate lunch and icecream': 4,
'didnt have water': 3,
'didnt have Water': 2 # case sensitives
}
Using re module is out of question since problem is close to sequence matching or perhaps using nltk or perhaps scikit-learn ? I have some familiarity with NLP and scikit but not enough to solve this? If I solve this, I will publish it here.
from sequence matching algorithm in python
No comments:
Post a Comment