Friday, 25 June 2021

Pandas/SQL co-occurrence count

Lets say I have the following table/data frame:

d = {'store': ['s1', 's1', 's2', 's2',], 'product': ['a', 'c', 'a', 'c']}
    df = pd.DataFrame(data=d)


print(df)
    store  product
0     s1      a                 
1     s1      c                     
3     s2      a                  
4     s2      c                

I would like to find, for each pair of products the number of times they co-occur in a store.

Since the data is very large (5M rows and about 50K individual products & 20K individual stores) and there are many potential co-occurrence pairs, I would just like to get the top n (example: 10) co-occurrences for each product and the count of the cooccurrence. The example result is below:

    product_1  product_2     cooccurrence_count
0      a           c                  2 
1      c           a                  2

An effective and efficient solution in SQL instead of pandas would also be acceptable



from Pandas/SQL co-occurrence count

No comments:

Post a Comment