Monday, 9 September 2019

How to efficiently use CountVectorizer to get ngram counts for all files in a directory combined?

I have around 10k .bytes files in my directory and I want to use count vectorizer to get n_gram counts (i.e fit on train and transform on test set). In those 10k files I have 8k files as train and 2k as test.

files = 
['bfiles/GhHS0zL9cgNXFK6j1dIJ.bytes',
 'bfiles/8qCPkhNr1KJaGtZ35pBc.bytes',
 'bfiles/bLGq2tnA8CuxsF4Py9RO.bytes',
 'bfiles/C0uidNjwV8lrPgzt1JSG.bytes',
 'bfiles/IHiArX1xcBZgv69o4s0a.bytes',
    ...............................
    ...............................]

print(open(files[0]).read())
    'A4 AC 4A 00 AC 4F 00 00 51 EC 48 00 57 7F 45 00 2D 4B 42 45 E9 77 51 4D 89 1D 19 40 30 01 89 45 E7 D9 F6 47 E7 59 75 49 1F ....'

I can't do something like below and pass everything to CountVectorizer.

file_content = []
for file in file:
    file_content.append(open(file).read())

I can't append each file text to a big nested lists of files and then use CountVectorizer because the all combined text file size exceeds 150gb. I don't have resources to do that because CountVectorizer use huge amount of memory.

I need a more efficient way of solving this, Is there some other way I can achieve what I want without loading everything into memory at once. Any help is much appreciated.

All I could achieve was read one file and then use CountVectorizer but I don't know how to achieve what I'm looking for.

cv = CountVectorizer(ngram_range=(1, 4))
temp = cv.fit_transform([open(files[0]).read()])
temp
<1x451500 sparse matrix of type '<class 'numpy.int64'>'
    with 335961 stored elements in Compressed Sparse Row format>



from How to efficiently use CountVectorizer to get ngram counts for all files in a directory combined?

No comments:

Post a Comment