I'm currently trying to migrate some processes from python to (pandas on) spark to measure performance, everything went good until this point:
df_info is of type pyspark.pandas
nlp is defined as: nlp = spacy.load('es_core_news_sm', disable=["tagger", "parser"])
def preprocess_pipe(texts):
preproc_pipe = []
for doc in nlp.pipe(texts, batch_size=20):
preproc_pipe.append(lemmatize_pipe(doc))
return preproc_pipe
df_info['text_2'] = preprocess_pipe(df_info['text'])
I got this error on for doc on nlp.pipe(texts, batch_size=20):
PandasNotImplementedError: The method
pd.Series.__iter__()
is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Referenced here:
/local_disk0/.ephemeral_nfs/envs/pythonEnv-eb93782f-db7f-4f94-97bf-409a980a51f7/lib/python3.8/site-packages/spacy/language.py in pipe(self, texts, as_tuples, batch_size, disable, component_cfg, n_process)
1570 else:
1571 # if n_process == 1, no processes are forked.
-> 1572 docs = (self._ensure_doc(text) for text in texts)
1573 for pipe in pipes:
Any idea of how can I solve this?
from Pandas on Spark 3.2 -NLP.pipe - pd.Series.__iter__() is not implemented
No comments:
Post a Comment