Saturday, 19 March 2022

Pandas on Spark 3.2 -NLP.pipe - pd.Series.__iter__() is not implemented

I'm currently trying to migrate some processes from python to (pandas on) spark to measure performance, everything went good until this point:

df_info is of type pyspark.pandas

nlp is defined as: nlp = spacy.load('es_core_news_sm', disable=["tagger", "parser"])

def preprocess_pipe(texts):
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=20):
        preproc_pipe.append(lemmatize_pipe(doc))
    return preproc_pipe

df_info['text_2'] = preprocess_pipe(df_info['text'])

I got this error on for doc on nlp.pipe(texts, batch_size=20):

PandasNotImplementedError: The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.

Referenced here:

/local_disk0/.ephemeral_nfs/envs/pythonEnv-eb93782f-db7f-4f94-97bf-409a980a51f7/lib/python3.8/site-packages/spacy/language.py in pipe(self, texts, as_tuples, batch_size, disable, component_cfg, n_process)
   1570         else:
   1571             # if n_process == 1, no processes are forked.
-> 1572             docs = (self._ensure_doc(text) for text in texts)
   1573             for pipe in pipes:

Any idea of how can I solve this?



from Pandas on Spark 3.2 -NLP.pipe - pd.Series.__iter__() is not implemented

No comments:

Post a Comment