For a dataset DF, I would like to add temperatures taken from a secondary dataset TEMP. TEMP does not have all dates. Missing dates need to be interpolated in such a way that the latest available value is filled-in for missing values. For ex. if temperature is missing for 2019-2-20, but it was available for 2019-2-19, it will fill-in this value. This can be done by using pd.DataFrame(x.asfreq('D')).ffill().reset_index(drop=False)
.
When there are multiple temperature measures per day, a weighted average should be applied. In the current code, dates are pre-selected using isin
to limit datasize.
The code works, but is not optimal if data become big, paticularly if temperature data from 1000s of places and dates need to be filled-in and merged to DF. I am searching for a better solution regarding timing / memory, for ex. based on itertools, apply, generator expression, or anything else.
Below I show a fictive small reproducible example with code.
Modules:
import numpy as np
import pandas as pd
Fictive data:
DF = pd.DataFrame({'date': ['2019-01-01', '2019-01-11', '2019-01-13', '2019-01-14', '2019-01-22', '2019-02-14'], 'place':['A', 'A','A','A','B','C']})
TEMP = pd.DataFrame({'date':['2019-01-10', '2019-01-14', '2019-01-20', '2019-01-20', '2019-01-22', '2019-01-23', '2019-01-24', '2019-02-14', '2019-02-14'], 'place':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'], 'temp': [5, 7, 9, 10, 10, 2, 0, 10, 11], 'quantity': [3,5,2,1,1,2,2,1,3]})
DF['date'] = pd.to_datetime(DF['date'])
TEMP['date'] = pd.to_datetime(TEMP['date'])
The code:
if 'temp' in DF.columns:
del DF['temp']
else: print('No variable temp found in DF')
def filltemps(dat1, dat2):
"""dat1: TEMP
dat2: DF"""
global totmp
places = dat2['place'].unique()
mx=len(places)
totmp = pd.DataFrame(columns=['date', 'temp', 'place'])
for i in range(mx):
mp=[]
dd1=[]
nsp = pd.DataFrame(dat1[ (dat1['place']==places[i]) ])
nsp = nsp[['date', 'quantity', 'temp']]
prod = lambda w,z: w*z
nsp['sumn'] = prod(nsp['temp'], nsp['quantity'])
wavg = lambda y,x: y/x
c3 = wavg(nsp.groupby('date')['sumn'].agg('sum'), nsp.groupby('date')['quantity'].agg('sum'))
mp = pd.DataFrame(c3.asfreq('D')).ffill().reset_index(drop=False)
mp.columns = ['date', 'temp']
mp['place'] = np.array([places[i]] * len(mp))
mp['date'] = pd.to_datetime(mp['date'])
dd1 = dat2.loc[dat2['place']==places[i], ['date']]
mp = mp[ mp['date'].isin(list(pd.to_datetime(dd1['date']))) ]
totmp = pd.concat([totmp, mp])
return totmp
Timing by %timeit TEMP2 = filltemps(TEMP, DF)
shows 116 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In TEMP2, it can be seen that as expected a weighted average was computed for place C:
TEMP2 = filltemps(TEMP, DF)
TEMP2 will be merged with the original DF. Note that there can be missing values if no earlier data were available.
DF = pd.merge(DF, TEMP2, how='left', on=['date', 'place'])
DF, the expected outcome, should look like this:
Your pointers and help are very much appreciated!
from Filling-in missing data and merge with main set
No comments:
Post a Comment