Hemant Vishwakarma: Calculating Moving Average on bi-temporal dataset

Wednesday, 27 January 2021

Calculating Moving Average on bi-temporal dataset

I'm trying to calculate moving average on a bitemporal dataset. The dataset consists of a data date and an effective date(the date at which the data became available). The data for this date could be restated several times in future(same data date but a different effective date). I need to calculate a moving average for the past 4 quarters using the data that is valid for the effective date of the row being calculated.

The dataset looks like this

id	datadate	effdate	value
1	2005-03-31	2005-04-15	10
1	2005-03-31	2005-05-30	11
1	2005-06-30	2005-07-15	9
1	2005-06-30	2005-08-20	9.5
1	2005-06-30	2005-10-15	9.6
1	2005-09-30	2005-10-15	10.5
1	2005-09-30	2005-11-10	11
1	2005-12-31	2006-02-13	12

The result should be

id	datadate	effdate	Value	MAvg
1	2005-03-31	2005-04-15	10	10
1	2005-03-31	2005-05-30	11	11
1	2005-06-30	2005-07-15	9	10
1	2005-06-30	2005-08-20	9.5	10.25
1	2005-06-30	2005-10-15	9.6	10.30
1	2005-09-30	2005-10-15	10.5	10.37
1	2005-09-30	2005-11-10	11	10.53
1	2005-12-31	2006-02-13	12	10.90

I'm doing this in python using pandas. The way I'm doing this is by joining the dataframe with itself on id and previous 4 quarters and calculating new effdates for all periods based on the effdates of the past 4 quarters, then I join once again with id, datadate and effdate and calculate the average.

keys["id"]
calc_df = df1.merge(df2, on=keys, how='left')
calc_df = calc_df.loc[
            (calc_df["datadate_x"] >= calc_df["datadate_y"])
            & (calc_df["datadate_y"] >= calc_df["datadate_x"] - pd.tseries.offsets.MonthEnd(n=9))
            & (calc_df["effdate_x"] <= calc_df["thrudate_y"])
            & (calc_df["thrudate_x"] >= calc_df["effdate_y"])
        ]
calc_df = calc_df.drop_duplicates().reset_index(drop=True)
grp_keys = keys + ["datadate_x"]
calc_df["effdate"] = calc_df[["effdate_x", "effdate_y"]].max(axis=1)
calc_df = calc_df.sort_values(grp_keys + ["effdate"]).drop_duplicates(
            subset=grp_keys + ["effdate"], keep="first"
        )
calc_df = calc_df['id', 'datadate_x', 'effdate', 'value']

calc_df = calc_df.merge(df1, on=["id"], how="left")
calc_df = calc_df.loc[
            (calc_df["datadate_x"] >= calc_df["datadate"])
            & (
                calc_df["datadate"]
                >= calc_df["datadate_x"] - pd.tseries.offsets.MonthEnd(n=9)
            )
            & (calc_df["effdate_x"] <= calc_df["thrudate_y"])
            & (calc_df["thrudate_x"] >= calc_df["effdate_y"])
        ]

        
calc_df["MAvg"] = calc_df.groupby(["id", "datadate_x", "effdate_x"])["value"].transform(
            lambda s: s.mean(skipna=False)
        )

This works, but wanted to see if there is a better way of doing this.

from Calculating Moving Average on bi-temporal dataset

Hemant Vishwakarma

Wednesday, 27 January 2021

Calculating Moving Average on bi-temporal dataset

No comments:

Post a Comment