I'm trying to figure out an efficient split/apply/combine scheme for the following scenario. Consider the pandas dataframe demoAll
defined below:
import datetime
import pandas as pd
demoA = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
'ticker':['A', 'A', 'A'],
'x1':[10,20,30],
'close':[120, 133, 129]}).set_index('date', drop=True)
demoB = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
'ticker':['B', 'B', 'B'],
'x1':[18,11,45],
'close':[50, 49, 51]}).set_index('date', drop=True)
demoAll = pd.concat([demoA, demoB])
print(demoAll)
The result is:
ticker x1 close
date
2010-01-01 A 10 120
2010-01-02 A 20 133
2010-01-03 A 30 129
2010-01-01 B 18 50
2010-01-02 B 11 49
2010-01-03 B 45 51
I also have a dictionary mapping of tickers to model objects
ticker2model = {'A':model_A, 'B':model_B,...}
where each model has a self.predict(df)
method that takes-in an entire dataframe and returns a series of the same length.
I now would like to create a new column, demoAll['predictions']
, that corresponds to these predictions. What is the cleanest/most-efficient way of doing this? A few things to note:
1.) demoAll
was the concatenation of ticker-specific dataframes that were each indexed just by date. Thus the indices of demoAll
are not unique. (However, the combination of date/ticker IS unique.)
2.) My thinking has been to do something like the example below, but running into issues with indexing, data-type coercions, and slow run times. The real dataset is quite large (both rows and columns).
demoAll['predictions'] = demoAll.groupby('ticker').apply(
lambda x: ticker2model[x.name].predict(x)
)
from Applying group-specific function that returns a single series
No comments:
Post a Comment