I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.
# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES
The code:
from collections import defaultdict
myl=defaultdict(list)
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
myl[tags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
except:
pass
output:
defaultdict(list,
{'S': ['test', 'test', 'girl', 'house', 'wind'],
'SE': ['girl'],
'E': ['girl', 'house', 'man', 'man', 'wind'],
'': ['home'],
'SE123': ['house'],
'1': ['house'],
'2': ['house'],
'3': ['house'],
'ES': ['wind']})
SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?
Update:
I have managed to add bigrams, but how do I add 3, 4, 5 grams?
from collections import defaultdict
import nltk
myl=defaultdict(list)
with open('hi_IN.dic') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
ntags=''.join(sorted(tags))
myl[ntags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
bigrm = list(nltk.bigrams([i for i in tags]))
nlist=[x+y for x, y in bigrm]
for t1 in nlist:
t1a=''.join(sorted(t1))
myl[t1a].append(l.split('/')[0])
except:
pass
I guess it would help if I sort the tags at source:
with open('test1.txt', 'w') as nf:
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
except IndexError:
nline= l
else:
ntags=''.join(sorted(tags))
nline= l.split('/')[0] + '/' + ntags
nf.write(nline+'\n')
This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.
from group and classify words as well as characters
No comments:
Post a Comment