I have a 900M row dataset that I'd like to apply some machine learning algorithms on using pyspark/mllib and I'm struggling a bit with how to transform my dataset into the correct format.
The data is a bit odd, in that it has multiple rows and columns belonging to the same variables.
import pandas as pd
mapDict = {'Letter':'Letter','Color1':'Color', 'Color2':'Color','Color3':'Color'}
df = pd.DataFrame([['Foo','A','Green','Red','Blue'],['Foo','B','Red','Orange','Blue'],['Bar','A','Orange','Yellow','Green']],
columns = ['Type','Letter','Color1','Color2','Color3'])
df
Type Letter Color1 Color2 Color3
0 Foo A Green Red Blue
1 Foo B Red Orange Blue
2 Bar A Orange Yellow Green
I need to denormalize the frequency count of the two variables (which in my example are letter and color) so I can vectorize appropriately.
Here are the transformations I would effectively do in pandas, and the desired output:
mapDict = {'Letter':'Letter','Color1':'Color', 'Color2':'Color','Color3':'Color'}
df_out = pd.melt(df,id_vars = ['Type'],value_vars = ['Letter','Color1','Color2','Color3'])
df_out['variable'] = df_out['variable'].map(mapDict)
df_out = df_out.groupby(['variable','value','Type']).size().unstack().transpose().fillna(0)
df_out.columns = df_out.columns.droplevel(0)
print df_out
value Blue Green Orange Red Yellow A B
Type
Bar 0 1 1 0 1 1 0
Foo 2 1 1 2 0 1 1
Right now I have a custom rdd.flatmap function followed by a groupby.pivot, but this is sloooowww and inefficient and doesn't account for there being two rows with the same type.
Does anyone have any suggestions for how I could write an equivalent version of the pandas code I have to denormalize value counts by type in pyspark? Any guidance/insight would be appreciated!
from Pyspark data manipulation to vectorized format
No comments:
Post a Comment