I am trying to create a document term matrix using CountVectorizer to extract bigrams and trigrams from a corpus.
from sklearn.feature_extraction.text import CountVectorizer
lemmatized = dat_clean['lemmatized']
c_vec = CountVectorizer(ngram_range=(2,3), lowercase = False)
ngrams = c_vec.fit_transform(lemmatized)
count_values = ngrams.toarray().sum(axis=0)
vocab = c_vec.vocabulary_
df_ngram = pd.DataFrame(sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)
).rename(columns={0: 'frequency', 1:'bigram/trigram'})
I keep getting the following error:
MemoryError: Unable to allocate 7.89 TiB for an array with shape (84891, 12780210) and data type int64
While I have some experience with Python, I am pretty new to dealing with text data. I was wondering if there was a more memory efficient way to address this issue.
I'm not sure if it is helpful to know, but the ngrams
object is a scipy.sparse._csr.csr_matrix.
from Memory Issue: Creating Bigrams and Trigrams with CountVectorizer
No comments:
Post a Comment