Sunday, 24 February 2019

Filling-in missing data and merge with main set

For a dataset DF, I would like to add temperatures taken from a secondary dataset TEMP. TEMP does not have all dates. Missing dates need to be interpolated in such a way that the latest available value is filled-in for missing values. For ex. if temperature is missing for 2019-2-20, but it was available for 2019-2-19, it will fill-in this value. This can be done by using pd.DataFrame(x.asfreq('D')).ffill().reset_index(drop=False).

When there are multiple temperature measures per day, a weighted average should be applied. In the current code, dates are pre-selected using isin to limit datasize.

The code works, but is not optimal if data become big, paticularly if temperature data from 1000s of places and dates need to be filled-in and merged to DF. I am searching for a better solution regarding timing / memory, for ex. based on itertools, apply, generator expression, or anything else.

Below I show a fictive small reproducible example with code.

Modules:

import numpy as np 
import pandas as pd

Fictive data:

DF = pd.DataFrame({'date': ['2019-01-01', '2019-01-11', '2019-01-13', '2019-01-14', '2019-01-22', '2019-02-14'], 'place':['A', 'A','A','A','B','C']})
TEMP = pd.DataFrame({'date':['2019-01-10', '2019-01-14', '2019-01-20', '2019-01-20', '2019-01-22', '2019-01-23', '2019-01-24', '2019-02-14', '2019-02-14'], 'place':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'], 'temp': [5, 7, 9, 10, 10, 2, 0, 10, 11], 'quantity': [3,5,2,1,1,2,2,1,3]})
DF['date'] = pd.to_datetime(DF['date'])
TEMP['date'] = pd.to_datetime(TEMP['date'])

The code:

if 'temp' in DF.columns: 
    del DF['temp']
else: print('No variable temp found in DF')


def filltemps(dat1, dat2):
    """dat1: TEMP
    dat2: DF"""

    global totmp
    places = dat2['place'].unique()
    mx=len(places)
    totmp = pd.DataFrame(columns=['date', 'temp', 'place'])

    for i in range(mx):
        mp=[]
        dd1=[]
        nsp = pd.DataFrame(dat1[ (dat1['place']==places[i]) ])
        nsp = nsp[['date', 'quantity', 'temp']]
        prod = lambda w,z: w*z
        nsp['sumn'] = prod(nsp['temp'], nsp['quantity'])
        wavg = lambda y,x: y/x
        c3 = wavg(nsp.groupby('date')['sumn'].agg('sum'), nsp.groupby('date')['quantity'].agg('sum'))
        mp = pd.DataFrame(c3.asfreq('D')).ffill().reset_index(drop=False)
        mp.columns = ['date', 'temp']
        mp['place'] = np.array([places[i]] * len(mp))
        mp['date'] = pd.to_datetime(mp['date'])
        dd1 = dat2.loc[dat2['place']==places[i], ['date']]
        mp = mp[ mp['date'].isin(list(pd.to_datetime(dd1['date']))) ]
        totmp = pd.concat([totmp, mp])
    return totmp

Timing by %timeit TEMP2 = filltemps(TEMP, DF) shows 116 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In TEMP2, it can be seen that as expected a weighted average was computed for place C:

TEMP2 = filltemps(TEMP, DF)

TEMP2 will be merged with the original DF. Note that there can be missing values if no earlier data were available.

DF = pd.merge(DF, TEMP2, how='left', on=['date', 'place'])

DF, the expected outcome, should look like this:

enter image description here

Your pointers and help are very much appreciated!



from Filling-in missing data and merge with main set

No comments:

Post a Comment