Lets say I have the following table/data frame:
d = {'store': ['s1', 's1', 's2', 's2',], 'product': ['a', 'c', 'a', 'c']}
df = pd.DataFrame(data=d)
print(df)
store product
0 s1 a
1 s1 c
3 s2 a
4 s2 c
I would like to find, for each pair of products the number of times they co-occur in a store.
Since the data is very large (5M rows and about 50K individual products & 20K individual stores) and there are many potential co-occurrence pairs, I would just like to get the top n (example: 10) co-occurrences for each product and the count of the cooccurrence. The example result is below:
product_1 product_2 cooccurrence_count
0 a c 2
1 c a 2
An effective and efficient solution in SQL instead of pandas would also be acceptable
from Pandas/SQL co-occurrence count
No comments:
Post a Comment