I'm trying to replicate pandas's merge_asof behavior when joining spark dataframes.
Let's just say I have 2 dataframe, df1 and df2:
df1 = pd.DataFrame([{"timestamp": 0.5 * i, "a": i * 2} for i in range(66)])
df2 = pd.DataFrame([{"timestamp": 0.33 * i, "b": i} for i in range(100)])
# use merge_asof to merge df1 and df2
merge_df = pd.merge_asof(df1, df2, on='timestamp', direction='nearest', tolerance=df.timestamp.diff().mean() - 1e-6)
Result after merge on merge_df would be:
| timestamp | a | b |
|---|---|---|
| 0.0 | 0 | 0 |
| 0.5 | 2 | 2 |
| 1.0 | 4 | 3 |
| 1.5 | 6 | 5 |
| 2.0 | 8 | 6 |
| ... | ... | .. |
| 30.5 | 122 | 92 |
| 31.0 | 124 | 94 |
| 31.5 | 126 | 95 |
| 32.0 | 128 | 97 |
| 32.5 | 130 | 98 |
Now given similar dataframes in spark:
df1_spark = spark.createDataFrame([{"timestamp": 0.5 * i, "a": i * 2} for i in range(66)])
df2_spark = spark.createDataFrame([{"timestamp": 0.33 * i, "b": i} for i in range(100)])
How to join 2 spark dataframes to produce similar result as in pandas, with configurable direction and tolerance? Thanks.
[Edit]
As suggestion from similar posts, applying function over Window would create similar behavior to direction parameter. However, I still don't know how to apply function to find nearest row (like how nearest would behave) and within a certain range (tolerance).
from Equivalent of pandas merge_asof when joining spark dataframes, with merge nearest and tolerance
No comments:
Post a Comment