Thursday, 4 February 2021

How to select random data in a dataset for all group excluding current group

I have a dataset as follows

       Cardholder Last Name Cardholder First Initial    Amount  ...        Transaction Date                       Merchant Category Code (MCC) Class
0                      Gunn                        T     32.00  ...  07/31/2013 12:00:00 AM                           COMPUTER SOFTWARE STORES     V
1                   Claunch                        R     53.91  ...  07/22/2013 12:00:00 AM                       HOME SUPPLY WAREHOUSE STORES     V
2       UNIVERSITY AMERICAN                        G    412.00  ...  03/05/2014 12:00:00 AM                                  AMERICAN AIRLINES     V
3       UNIVERSITY AMERICAN                        G   1481.60  ...  04/14/2014 12:00:00 AM                                    UNITED AIRLINES     V
4                   JACKSON                        D    104.99  ...  01/23/2014 12:00:00 AM  MENS, WOMENS AND CHILDRENS UNIFORMS AND COMMER...     V
...                     ...                      ...       ...  ...                     ...                                                ...   ...
221001            MELAKAYIL                        M     87.00  ...  10/18/2013 12:00:00 AM                                       HOLIDAY INNS     V
221002                BAIRD                        N  12010.05  ...  08/27/2013 12:00:00 AM  COMPUTERS, COMPUTER PERIPHERAL EQUIPMENT, SOFT...     V
221003               Tobler                        M      4.19  ...  03/15/2014 12:00:00 AM  CHEMICALS AND ALLIED PRODUCTS NOT ELSEWHERE CL...     V
221004                 KING                        R    283.74  ...  02/25/2014 12:00:00 AM                           COMPUTER SOFTWARE STORES     V
221005               Dodson                        D     19.03  ...  04/04/2014 12:00:00 AM  DENTAL/LABORATORY/MEDICAL/OPHTHALMIC HOSP EQIP...     V

[221006 rows x 7 columns]
Index(['Cardholder Last Name', 'Cardholder First Initial', 'Amount', 'Vendor',
       'Transaction Date', 'Merchant Category Code (MCC)', 'Class'],
      dtype='object')

Also I have a list of groups as a list of lists.

['Oudin',       'A']
['LINDLEY',     'D']
['Champlin',    'B']
['JOHNSTON',    'D']
['EDDINGS',     'A']
...
['Kornegay',    'V']
['Thurman',     'T']
['Hunt',        'S']
['THOMAS',      'L']
['Skinner',     'R']

I want to create fake instances for all groups with Class value F. But I don't want to include current group's instances while creating fake instances.

I tried below code

group_labels = ['Cardholder Last Name', 'Cardholder First Initial']

# select random data in the merged dataset excluding current group
mixed_transactions = pd.DataFrame(columns=list(df.columns))

for group in groups:
    # Get rows from df for current group
    current_group = df[(df[group_labels[0]] == group[0]) & (df[group_labels[1]] == group[1])]

    if len(df.index) > 0:
        # DF - group  -> difference
        df_without_current_group = df[(df[group_labels[0]] != group[0]) | (df[group_labels[1]] != group[1])]
        
        # Pick random samples from 'df_without_current_group' as large as the size of the current group
        fake_transactions = df_without_current_group.sample(n=len(current_group.index))

        # Set columns for fake transactions
        fake_transactions['Class'] = 'F'
        fake_transactions[group_labels[0]] = group[0]
        fake_transactions[group_labels[1]] = group[1]

        mixed_transactions = pd.concat([mixed_transactions, current_group, fake_transactions])

But my approach is so slow(Applied to 10 groups in 3 seconds).

How can I do it faster?



from How to select random data in a dataset for all group excluding current group

No comments:

Post a Comment