Hemant Vishwakarma: Pyspark data manipulation to vectorized format

Tuesday, 27 November 2018

Pyspark data manipulation to vectorized format

I have a 900M row dataset that I'd like to apply some machine learning algorithms on using pyspark/mllib and I'm struggling a bit with how to transform my dataset into the correct format.

The data is a bit odd, in that it has multiple rows and columns belonging to the same variables.

import pandas as pd

mapDict = {'Letter':'Letter','Color1':'Color', 'Color2':'Color','Color3':'Color'}

df = pd.DataFrame([['Foo','A','Green','Red','Blue'],['Foo','B','Red','Orange','Blue'],['Bar','A','Orange','Yellow','Green']],
                  columns = ['Type','Letter','Color1','Color2','Color3'])

df

  Type Letter  Color1  Color2 Color3
0  Foo      A   Green     Red   Blue
1  Foo      B     Red  Orange   Blue
2  Bar      A  Orange  Yellow  Green

I need to denormalize the frequency count of the two variables (which in my example are letter and color) so I can vectorize appropriately.

Here are the transformations I would effectively do in pandas, and the desired output:

mapDict = {'Letter':'Letter','Color1':'Color', 'Color2':'Color','Color3':'Color'}

df_out = pd.melt(df,id_vars = ['Type'],value_vars = ['Letter','Color1','Color2','Color3'])
df_out['variable'] = df_out['variable'].map(mapDict)
df_out = df_out.groupby(['variable','value','Type']).size().unstack().transpose().fillna(0)
df_out.columns = df_out.columns.droplevel(0)

print df_out
value  Blue  Green  Orange  Red  Yellow  A  B
Type                                         
Bar       0      1       1    0       1  1  0
Foo       2      1       1    2       0  1  1

Right now I have a custom rdd.flatmap function followed by a groupby.pivot, but this is sloooowww and inefficient and doesn't account for there being two rows with the same type.

Does anyone have any suggestions for how I could write an equivalent version of the pandas code I have to denormalize value counts by type in pyspark? Any guidance/insight would be appreciated!

from Pyspark data manipulation to vectorized format

Hemant Vishwakarma

Tuesday, 27 November 2018

Pyspark data manipulation to vectorized format

No comments:

Post a Comment