The Situation
I'm classifying the rows in a DataFrame using a certain classifier based on the values in a particular column. My goal is to append the results to one new column or another depending on certain conditions. The code, as it stands looks something like this:
df = pd.DataFrame({'A': [list with classifier ids], # Only 3 ids, One word strings
'B': [List of text to be classified], # Millions of unique rows, lines of text around 5-25 words long
'C': [List of the old classes]} # Hundreds of possible classes, four digit integers stored as strings
df.sort_values('A', inplace=True)
new_col1, new_col2 = [], []
for name, group in df.groupby('A', sort=False):
classifier = classy_dict[name]
vectors = vectorize(group.B.values)
preds = classifier.predict(vectors)
scores = classifier.decision_function(vectors)
for tup in zip(preds, scores, group.C.values):
if tup[2] == tup[0]:
new_col1.append(np.nan)
new_col2.append(tup[2])
else:
new_col1.append(str(classifier.classes_[tup[1].argsort()[-5:]]))
new_col2.append(np.nan)
df['D'] = new_col1
df['E'] = new_col2
The Issue
I am concerned that groupby
will not iterate in a top-down, order-of-appearance manner as I expect. Iteration order when sort=False
is not covered in the docs
My Expectations
All I'm looking for here is some affirmation that groupby('col', sort=False)
does iterate in the top-down order-of-appearance way that I expect. If there is a better way to make all of this work, suggestions are appreciated.
Here is the code I used to test my theory on sort=False
iteration order:
from numpy.random import randint
import pandas as pd
from string import ascii_lowercase as lowers
df = pd.DataFrame({'A': [lowers[randint(3)] for _ in range(100)],
'B': randint(10, size=100)})
print(df.A.unique()) # unique values in order of appearance per the docs
for name, group in df.groupby('A', sort=False):
print(name)
Edit: The above code makes it appear as though it acts in the manner that I expect, but I would like some more undeniable proof, if it is available.
from Iteration order with pandas groupby on a pre-sorted DataFrame
No comments:
Post a Comment