Sunday, 9 September 2018

Efficient loop that counts unique values in a pandas df

I am trying to create a loop or a more efficient process that can count the amount of current values in a pandas df. At the moment I'm selecting the value I want to perform the function on.

So for the above df below I want to count how many values are left for Code ['A']. I select this via the following:

import pandas as pd

d = ({
'Code' : ['A','A','B','A','B','B','A','B','A','A'],            
'Area' : ['Home','Home','Shops','Park','Cafe','Shops','Home','Cafe','Work','Park'],  
 })

df = pd.DataFrame(data=d)

#Select code
df1 = df[df.Code == 'A'].copy()

df1['u'] = df1[::-1].groupby('Area').Area.cumcount()

ids = [1]
seen = set([df1.iloc[0].Area])
dec = False
for val, u in zip(df1.Area[1:], df1.u[1:]):
    ids.append(ids[-1] + (val not in seen) - dec)
    seen.add(val)
    dec = u == 0
df1['On'] = ids

df1 = df1.reindex(df.index).fillna(df)

So this returns ['u'], which is how many values are left. And ['On'], which is how many total values running.

The problem is if I want to perform the same function on other values in Code. For instance, if I want to do the same thing on B, I would have to do the same but change:

df2 = df1[df1.Code == 'B'].copy()

If I have numerous values in Code to run the script on it becomes very inefficient. e.g A,B,C,D,E etc. I need a loop where it finds all unique values in 'Code' and runs the script on them? Ideally, the script would look like:

df1 = df[df.Code == 'All unique values'].copy()

Intended Output:

  Code   Area    u   On
0    A   Home  2.0  1.0
1    A   Home  1.0  1.0
2    B  Shops  1.0  1.0
3    A   Park  1.0  2.0
4    B   Cafe  1.0  2.0
5    B  Shops  0.0  2.0
6    A   Home  0.0  2.0
7    B   Cafe  0.0  1.0
8    A   Work  0.0  2.0
9    A   Park  0.0  1.0



from Efficient loop that counts unique values in a pandas df

No comments:

Post a Comment