I'm trying to replicate pandas's merge_asof behavior when joining spark dataframes.
Let's just say I have 2 dataframe, df1
and df2
:
df1 = pd.DataFrame([{"timestamp": 0.5 * i, "a": i * 2} for i in range(66)])
df2 = pd.DataFrame([{"timestamp": 0.33 * i, "b": i} for i in range(100)])
# use merge_asof to merge df1 and df2
merge_df = pd.merge_asof(df1, df2, on='timestamp', direction='nearest', tolerance=df.timestamp.diff().mean() - 1e-6)
Result after merge on merge_df
would be:
timestamp | a | b |
---|---|---|
0.0 | 0 | 0 |
0.5 | 2 | 2 |
1.0 | 4 | 3 |
1.5 | 6 | 5 |
2.0 | 8 | 6 |
... | ... | .. |
30.5 | 122 | 92 |
31.0 | 124 | 94 |
31.5 | 126 | 95 |
32.0 | 128 | 97 |
32.5 | 130 | 98 |
Now given similar dataframes in spark:
df1_spark = spark.createDataFrame([{"timestamp": 0.5 * i, "a": i * 2} for i in range(66)])
df2_spark = spark.createDataFrame([{"timestamp": 0.33 * i, "b": i} for i in range(100)])
How to join 2 spark dataframes to produce similar result as in pandas, with configurable direction
and tolerance
? Thanks.
[Edit]
As suggestion from similar posts, applying function over Window
would create similar behavior to direction
parameter. However, I still don't know how to apply function to find nearest row (like how nearest
would behave) and within a certain range (tolerance
).
from Equivalent of pandas merge_asof when joining spark dataframes, with merge nearest and tolerance
No comments:
Post a Comment