Thursday, 8 November 2018

group and classify words as well as characters

I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.

# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES

The code:

from collections import defaultdict
myl=defaultdict(list)

with open('test.txt') as f :
    for l in f:
        l = l.rstrip()
        try:
            tags = l.split('/')[1]
            myl[tags].append(l.split('/')[0])
            for t in tags:
                myl[t].append( l.split('/')[0])
        except:
            pass

output:

defaultdict(list,
            {'S': ['test', 'test', 'girl', 'house', 'wind'],
             'SE': ['girl'],
             'E': ['girl', 'house', 'man', 'man', 'wind'],
             '': ['home'],
             'SE123': ['house'],
             '1': ['house'],
             '2': ['house'],
             '3': ['house'],
             'ES': ['wind']})

SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?


Update:

I have managed to add bigrams, but how do I add 3, 4, 5 grams?

from collections import defaultdict
import nltk
myl=defaultdict(list)

with open('hi_IN.dic') as f :
    for l in f:
        l = l.rstrip()
        try:
            tags = l.split('/')[1]
            ntags=''.join(sorted(tags))
            myl[ntags].append(l.split('/')[0])
            for t in tags:
                myl[t].append( l.split('/')[0])
            bigrm = list(nltk.bigrams([i for i in tags]))
            nlist=[x+y for x, y in bigrm]
            for t1 in nlist:
                t1a=''.join(sorted(t1))
                myl[t1a].append(l.split('/')[0])
        except:
            pass


I guess it would help if I sort the tags at source:

with open('test1.txt', 'w') as nf:
    with open('test.txt') as f :
        for l in f:
            l = l.rstrip()
            try:
                tags = l.split('/')[1]
            except IndexError:
                nline= l 
            else:
                ntags=''.join(sorted(tags))
                nline= l.split('/')[0] + '/' + ntags
            nf.write(nline+'\n')

This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.



from group and classify words as well as characters

No comments:

Post a Comment