I have 3 pandas dataframes
df_a = pd.DataFrame(data={
'id': [1, 5, 3, 2],
'ts': [3, 5, 11, 14],
'other_cols': ['...'] * 4
})
df_b = pd.DataFrame(data={
'id': [2, 1, 3],
'ts': [7, 8, 15],
'other_cols': ['...'] * 3
})
df_c = pd.DataFrame(data={
'id': [154, 237, 726, 814, 528, 237, 248, 514],
'ts': [1, 2, 4, 6, 9, 10, 12, 13],
'other_cols': ['...'] * 8
})
Here is the problem I need to solve.
- for every
id
indf_a
find the correspondingid
indf_b
and their timestamps. Lets assumets_a
andts_b
. - find all the rows in
df_c
betweenmin(ts_a, ts_b)
andmax(ts_a, ts_b)
and calculate some custom function on these rows. This function can be a pd function (in 95% of the time) but it can be any python function.
Here are examples of rows for each ids (id, ts):
- id 1:
[726, 4], [814, 6]
- id 2:
[528, 9], [237, 10], [248, 12], [514, 13]
- id 3:
[248, 12], [514, 13]
- id 5: can be found only in A, but not in B, so nothing should be done
The output does not really matter, so anything that can map id
to f(rows for that id)
would do the job.
For example let's assume that I need to apply a simple len
function on results, I will get the following results
id | res |
---|---|
1 | 2 |
2 | 4 |
3 | 2 |
If my function is max(ts) - min(ts)
, the results are:
id | res |
---|---|
1 | 2 = 6 - 4 |
2 | 4 = 13 - 9 |
3 | 1 = 13 - 12 |
Here are the assumptions on dataframes:
ids
in each corresponding tables are unique- each dataframe is sorted by
ts
- there might exist
id
indf_a
which does not exist indf_b
and wise versa (but the percentage of missed ids is less than 1%) - tables A/B can be on the size of tens of millions, table C is on the size of hundreds of millions
- although theoretically there can be any number of rows between timestamps, empirical observations found that median number is in two digit number and the maximum is slightly more than a thousand
My working solutions
Attempt 1
- create a dictionary
id -> ts
fromdf_b
. Linear in terms of length of df_b - create a sorted list of
ts, other_cols
fromdf_c
. Linear in terms of df_c as it is already sorted by ts - iterate over df_a, then for each id find the ts in dictionary. Then 2 times do binary search in sorted list to find the edges of the data which should be analyzed. Then apply the function
Attempt 2
- combine all the dataframe in one and order by ts
df = pd.concat([df_a, df_b, df_c]).sort_values(by='ts').reset_index(drop=True)
- iterate over this dataframe in a sliding window approach and maintain dictionary
seen_ids
(id -> index
) where you put ids from table A/B. If you see the id, in this dictionary, thendf.iloc[index_1:index_2]
, filter them to only rows in C and apply the function
Both attempts work correctly and run in loglinear time but for my data it takes ~20-30 mins to run, which is bearable but not ideal. On top of this there is an issue with additional memory requirement to store additional data.
My question to pandas gurus
Can this be achieved with pure pandas and be more efficient than my custom implementation?
This question is important to me, so I am more than happy to provide a 500 bounty for a solution which can beat my current solutions (in terms of speed/memory).
from For the same ids in df_a and df_b, apply the custom function for df_c between those values
No comments:
Post a Comment