Hemant Vishwakarma: How to sample from DataFrame based on percentile of a column?

Monday, 15 February 2021

How to sample from DataFrame based on percentile of a column?

Given a dataset like this:

import pandas as pd

rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60}, 
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40}, 
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11}, 
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10}, 
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3}, 
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1}, 
{'key': 'HAHA', 'freq': 1}]

df = pd.DataFrame(rows)

df['percent'] = df['freq'] / sum(df['freq'])

[out]:

key freq    percent
0   ABC 100 0.328947
1   DEF 60  0.197368
2   GHI 50  0.164474
3   JKL 40  0.131579
4   MNO 13  0.042763
5   PQR 11  0.036184
6   STU 10  0.032895
7   VWX 10  0.032895
8   YZZ 3   0.009868
9   WHYQ    3   0.009868
10  HOWEE   2   0.006579
11  DUH 1   0.003289
12  HAHA    1   0.003289

The goal is to

select 1 example from the top 50-100 percentile of the frequency
select 2 examples from the 10-50 percentile and
select 4 example from < 10 percentile

In this case, the answer that fits are:

Pick 1 from ['ABC', 'DEF']
Pick 2 from ['GHI', 'JKL', 'MNO', 'PQR']
Pick 4 from ['VWX', 'STU', 'YZZ', 'WHYQ', 'HOWEE', 'HAHA', 'DUH']

I've tried this:

import random
import pandas as pd

rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60}, 
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40}, 
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11}, 
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10}, 
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3}, 
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1}, 
{'key': 'HAHA', 'freq': 1}]

df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])

bin_50_100 = []
bin_10_50 = []
bin_10 = []

total_percent = 1.0
for idx, row in df.sort_values(by=['freq', 'key'], ascending=False).iterrows():
    if total_percent > 0.5:
        bin_50_100.append(row['key'])
    elif 0.1 < total_percent < 0.5:
        bin_10_50.append(row['key'])
    else:
        bin_10.append(row['key'])
    total_percent -= row['percent']

    
    
print(random.sample(bin_50_100, 1))
print(random.sample(bin_10_50, 2))
print(random.sample(bin_10, 4))

[out]:

['DEF']
['MNO', 'PQR']
['HOWEE', 'WHYQ', 'HAHA', 'DUH']

But is there a simpler way to solve the problem?

from How to sample from DataFrame based on percentile of a column?

Hemant Vishwakarma

Monday, 15 February 2021

How to sample from DataFrame based on percentile of a column?

No comments:

Post a Comment