Given a dataset like this:
import pandas as pd
rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60},
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40},
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11},
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10},
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3},
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1},
{'key': 'HAHA', 'freq': 1}]
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
[out]:
key freq percent
0 ABC 100 0.328947
1 DEF 60 0.197368
2 GHI 50 0.164474
3 JKL 40 0.131579
4 MNO 13 0.042763
5 PQR 11 0.036184
6 STU 10 0.032895
7 VWX 10 0.032895
8 YZZ 3 0.009868
9 WHYQ 3 0.009868
10 HOWEE 2 0.006579
11 DUH 1 0.003289
12 HAHA 1 0.003289
The goal is to
- select 1 example from the top 50-100 percentile of the frequency
- select 2 examples from the 10-50 percentile and
- select 4 example from < 10 percentile
In this case, the answer that fits are:
- Pick 1 from
['ABC', 'DEF'] - Pick 2 from
['GHI', 'JKL', 'MNO', 'PQR'] - Pick 4 from
['VWX', 'STU', 'YZZ', 'WHYQ', 'HOWEE', 'HAHA', 'DUH']
I've tried this:
import random
import pandas as pd
rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60},
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40},
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11},
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10},
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3},
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1},
{'key': 'HAHA', 'freq': 1}]
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
bin_50_100 = []
bin_10_50 = []
bin_10 = []
total_percent = 1.0
for idx, row in df.sort_values(by=['freq', 'key'], ascending=False).iterrows():
if total_percent > 0.5:
bin_50_100.append(row['key'])
elif 0.1 < total_percent < 0.5:
bin_10_50.append(row['key'])
else:
bin_10.append(row['key'])
total_percent -= row['percent']
print(random.sample(bin_50_100, 1))
print(random.sample(bin_10_50, 2))
print(random.sample(bin_10, 4))
[out]:
['DEF']
['MNO', 'PQR']
['HOWEE', 'WHYQ', 'HAHA', 'DUH']
But is there a simpler way to solve the problem?
from How to sample from DataFrame based on percentile of a column?
No comments:
Post a Comment