Thursday 27 October 2022

Equivalent of pandas merge_asof when joining spark dataframes, with merge nearest and tolerance

I'm trying to replicate pandas's merge_asof behavior when joining spark dataframes.

Let's just say I have 2 dataframe, df1 and df2:

df1 = pd.DataFrame([{"timestamp": 0.5 * i, "a": i * 2} for i in range(66)])
df2 = pd.DataFrame([{"timestamp": 0.33 * i, "b": i} for i in range(100)])

# use merge_asof to merge df1 and df2
merge_df = pd.merge_asof(df1, df2, on='timestamp', direction='nearest', tolerance=df.timestamp.diff().mean() - 1e-6)

Result after merge on merge_df would be:

timestamp a b
0.0 0 0
0.5 2 2
1.0 4 3
1.5 6 5
2.0 8 6
... ... ..
30.5 122 92
31.0 124 94
31.5 126 95
32.0 128 97
32.5 130 98

Now given similar dataframes in spark:

df1_spark = spark.createDataFrame([{"timestamp": 0.5 * i, "a": i * 2} for i in range(66)])
df2_spark = spark.createDataFrame([{"timestamp": 0.33 * i, "b": i} for i in range(100)])

How to join 2 spark dataframes to produce similar result as in pandas, with configurable direction and tolerance? Thanks.

[Edit]
As suggestion from similar posts, applying function over Window would create similar behavior to direction parameter. However, I still don't know how to apply function to find nearest row (like how nearest would behave) and within a certain range (tolerance).



from Equivalent of pandas merge_asof when joining spark dataframes, with merge nearest and tolerance

No comments:

Post a Comment