I have a dataset as follows
Cardholder Last Name Cardholder First Initial Amount ... Transaction Date Merchant Category Code (MCC) Class
0 Gunn T 32.00 ... 07/31/2013 12:00:00 AM COMPUTER SOFTWARE STORES V
1 Claunch R 53.91 ... 07/22/2013 12:00:00 AM HOME SUPPLY WAREHOUSE STORES V
2 UNIVERSITY AMERICAN G 412.00 ... 03/05/2014 12:00:00 AM AMERICAN AIRLINES V
3 UNIVERSITY AMERICAN G 1481.60 ... 04/14/2014 12:00:00 AM UNITED AIRLINES V
4 JACKSON D 104.99 ... 01/23/2014 12:00:00 AM MENS, WOMENS AND CHILDRENS UNIFORMS AND COMMER... V
... ... ... ... ... ... ... ...
221001 MELAKAYIL M 87.00 ... 10/18/2013 12:00:00 AM HOLIDAY INNS V
221002 BAIRD N 12010.05 ... 08/27/2013 12:00:00 AM COMPUTERS, COMPUTER PERIPHERAL EQUIPMENT, SOFT... V
221003 Tobler M 4.19 ... 03/15/2014 12:00:00 AM CHEMICALS AND ALLIED PRODUCTS NOT ELSEWHERE CL... V
221004 KING R 283.74 ... 02/25/2014 12:00:00 AM COMPUTER SOFTWARE STORES V
221005 Dodson D 19.03 ... 04/04/2014 12:00:00 AM DENTAL/LABORATORY/MEDICAL/OPHTHALMIC HOSP EQIP... V
[221006 rows x 7 columns]
Index(['Cardholder Last Name', 'Cardholder First Initial', 'Amount', 'Vendor',
'Transaction Date', 'Merchant Category Code (MCC)', 'Class'],
dtype='object')
Also I have a list of groups as a list of lists.
['Oudin', 'A']
['LINDLEY', 'D']
['Champlin', 'B']
['JOHNSTON', 'D']
['EDDINGS', 'A']
...
['Kornegay', 'V']
['Thurman', 'T']
['Hunt', 'S']
['THOMAS', 'L']
['Skinner', 'R']
I want to create fake instances for all groups with Class value F. But I don't want to include current group's instances while creating fake instances.
I tried below code
group_labels = ['Cardholder Last Name', 'Cardholder First Initial']
# select random data in the merged dataset excluding current group
mixed_transactions = pd.DataFrame(columns=list(df.columns))
for group in groups:
# Get rows from df for current group
current_group = df[(df[group_labels[0]] == group[0]) & (df[group_labels[1]] == group[1])]
if len(df.index) > 0:
# DF - group -> difference
df_without_current_group = df[(df[group_labels[0]] != group[0]) | (df[group_labels[1]] != group[1])]
# Pick random samples from 'df_without_current_group' as large as the size of the current group
fake_transactions = df_without_current_group.sample(n=len(current_group.index))
# Set columns for fake transactions
fake_transactions['Class'] = 'F'
fake_transactions[group_labels[0]] = group[0]
fake_transactions[group_labels[1]] = group[1]
mixed_transactions = pd.concat([mixed_transactions, current_group, fake_transactions])
But my approach is so slow(Applied to 10 groups in 3 seconds).
How can I do it faster?
from How to select random data in a dataset for all group excluding current group
No comments:
Post a Comment