Hemant Vishwakarma: Equivalent of pandas merge_asof when joining spark dataframes, with merge nearest and tolerance

Thursday, 27 October 2022

Equivalent of pandas merge_asof when joining spark dataframes, with merge nearest and tolerance

I'm trying to replicate pandas's merge_asof behavior when joining spark dataframes.

Let's just say I have 2 dataframe, df1 and df2:

df1 = pd.DataFrame([{"timestamp": 0.5 * i, "a": i * 2} for i in range(66)])
df2 = pd.DataFrame([{"timestamp": 0.33 * i, "b": i} for i in range(100)])

# use merge_asof to merge df1 and df2
merge_df = pd.merge_asof(df1, df2, on='timestamp', direction='nearest', tolerance=df.timestamp.diff().mean() - 1e-6)

Result after merge on merge_df would be:

timestamp	a	b
0.0	0	0
0.5	2	2
1.0	4	3
1.5	6	5
2.0	8	6
...	...	..
30.5	122	92
31.0	124	94
31.5	126	95
32.0	128	97
32.5	130	98

Now given similar dataframes in spark:

df1_spark = spark.createDataFrame([{"timestamp": 0.5 * i, "a": i * 2} for i in range(66)])
df2_spark = spark.createDataFrame([{"timestamp": 0.33 * i, "b": i} for i in range(100)])

How to join 2 spark dataframes to produce similar result as in pandas, with configurable direction and tolerance? Thanks.

[Edit]
As suggestion from similar posts, applying function over Window would create similar behavior to direction parameter. However, I still don't know how to apply function to find nearest row (like how nearest would behave) and within a certain range (tolerance).

from Equivalent of pandas merge_asof when joining spark dataframes, with merge nearest and tolerance

Hemant Vishwakarma

Thursday, 27 October 2022

Equivalent of pandas merge_asof when joining spark dataframes, with merge nearest and tolerance

No comments:

Post a Comment