Friday, 14 September 2018

Python - For loop millions of rows

I have a dataframe c with lots of different columns. Also, arr is a dataframe that corresponds to a subset of c: arr = c[c['A_D'] == 'A'].

The main idea of my code is to iterate over all rows in the c-dataframe and search for all the possible cases (in the arr dataframe) where some specific conditions should happen:

  • It is only necessary to iterate over rows were c['A_D'] == D and c['Already_linked'] == 0
  • The hour in the arr dataframe must be less than the hour_aux in the c dataframe
  • The column Already_linked of the arr dataframe must be zero: arr.Already_linked == 0
  • The Terminal and the Operator needs to be the same in the c and arr dataframe

Right now, the conditions are stored using both Boolean indexing and groupby get_group:

  • Groupby the arr dataframe in order to choose the same Operator and Terminal: g = groups.get_group((row.Operator, row.Terminal))
  • Choose only the arrivals where the hour is smaller than the hour in the c dataframe and where Already_linked==0: vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]

For each of the rows in the c dataframe that verify all conditions, a vb dataframe is created. Naturally, this dataframe has different lengths in each iteration. After creating the vb dataframe, my goal is to choose the index of the vb dataframe that minimises the time between vb.START and c[x]. The FightID that corresponds to this index is then stored in the c dataframe on column a. Additionally, since the arrival was linked to a departure, the column Already_linked in the arr dataframe is changed from 0 to 1.

It is important to notice that the column Already_linked of the arr dataframe may change in every iteration (and arr.Already_linked == 0 is one of the conditions to create the vb dataframe). Therefore, it is not possible to parallelize this code.

I have already used c.itertuples() for efficiency, however since c has millions of rows, this code is still too time consuming.

Other option would also be to use pd.apply to every row. Nonetheless, this is not really straightforward since in each loop there are values that change in both c and arr (also, I believe that even with pd.apply it would be extremely slow).

Is there any possible way to convert this for loop in a vectorized solution (or to exponentially decrease the running time)?

Initial dataframe:

START     END       A_D     Operator     FlightID    Terminal   TROUND_ID   tot
0   2017-03-26 16:55:00 2017-10-28 16:55:00 A   QR  QR001   4   QR002       70
1   2017-03-26 09:30:00 2017-06-11 09:30:00 D   DL  DL001   3   "        "  84
2   2017-03-27 09:30:00 2017-10-28 09:30:00 D   DL  DL001   3   "        "  78
3   2017-10-08 15:15:00 2017-10-22 15:15:00 D   VS  VS001   3   "        "  45
4   2017-03-26 06:50:00 2017-06-11 06:50:00 A   DL  DL401   3   "        "  9
5   2017-03-27 06:50:00 2017-10-28 06:50:00 A   DL  DL401   3   "        "  19
6   2017-03-29 06:50:00 2017-04-19 06:50:00 A   DL  DL401   3   "        "  3
7   2017-05-03 06:50:00 2017-10-25 06:50:00 A   DL  DL401   3   "        "  32
8   2017-06-25 06:50:00 2017-10-22 06:50:00 A   DL  DL401   3   "        "  95
9   2017-03-26 07:45:00 2017-10-28 07:45:00 A   DL  DL402   3   "        "  58

Desired Output (some of the columns were excluded in the dataframe below. Only the a and Already_linked columns are relevant):

    START                    END             A_D  Operator  a   Already_linked
0   2017-03-26 16:55:00 2017-10-28 16:55:00 A   QR  0               1
1   2017-03-26 09:30:00 2017-06-11 09:30:00 D   DL  DL402           1
2   2017-03-27 09:30:00 2017-10-28 09:30:00 D   DL  DL401           1
3   2017-10-08 15:15:00 2017-10-22 15:15:00 D   VS  No_link_found   0
4   2017-03-26 06:50:00 2017-06-11 06:50:00 A   DL  0               0
5   2017-03-27 06:50:00 2017-10-28 06:50:00 A   DL  0               1
6   2017-03-29 06:50:00 2017-04-19 06:50:00 A   DL  0               0
7   2017-05-03 06:50:00 2017-10-25 06:50:00 A   DL  0               0
8   2017-06-25 06:50:00 2017-10-22 06:50:00 A   DL  0               0
9   2017-03-26 07:45:00 2017-10-28 07:45:00 A   DL  0               1

Code:

groups = arr.groupby(['Operator', 'Terminal'])
for row in c[(c.A_D == "D") & (c.Already_linked == 0)].itertuples():
    try:
        g = groups.get_group((row.Operator, row.Terminal))
        vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
        aux = (vb.START - row.x).abs().idxmin()
        c.loc[row.Index, 'a'] = vb.loc[aux].FlightID
        arr.loc[aux, 'Already_linked'] = 1
        continue
    except:
        continue

df['Already_linked'] = np.where((df.a != 0) & (df.a != 'No_link_found') & (df.A_D == 'D'), 1, df['Already_linked'])
df.Already_linked.loc[arr.Already_linked.index] = arr.Already_linked
df['a'] = np.where((df.Already_linked  == 0) & (df.A_D == 'D'),'No_link_found',df['a'])

Code for the initial c dataframe:

import numpy as np
import pandas as pd
import io

s = '''
 A_D     Operator     FlightID    Terminal   TROUND_ID   tot
 A   QR  QR001   4   QR002       70
 D   DL  DL001   3   "        "  84
 D   DL  DL001   3   "        "  78
 D   VS  VS001   3   "        "  45
 A   DL  DL401   3   "        "  9
 A   DL  DL401   3   "        "  19
 A   DL  DL401   3   "        "  3
 A   DL  DL401   3   "        "  32
 A   DL  DL401   3   "        "  95
 A   DL  DL402   3   "        "  58
'''

data_aux = pd.read_table(io.StringIO(s), delim_whitespace=True)
data_aux.Terminal = data_aux.Terminal.astype(str)
data_aux.tot= data_aux.tot.astype(str)

d = {'START': ['2017-03-26 16:55:00', '2017-03-26 09:30:00','2017-03-27 09:30:00','2017-10-08 15:15:00',
           '2017-03-26 06:50:00','2017-03-27 06:50:00','2017-03-29 06:50:00','2017-05-03 06:50:00',
           '2017-06-25 06:50:00','2017-03-26 07:45:00'], 'END': ['2017-10-28 16:55:00' ,'2017-06-11 09:30:00' ,
           '2017-10-28 09:30:00' ,'2017-10-22 15:15:00','2017-06-11 06:50:00' ,'2017-10-28 06:50:00', 
           '2017-04-19 06:50:00' ,'2017-10-25 06:50:00','2017-10-22 06:50:00' ,'2017-10-28 07:45:00']}    

aux_df = pd.DataFrame(data=d)
aux_df.START = pd.to_datetime(aux_df.START)
aux_df.END = pd.to_datetime(aux_df.END)
c = pd.concat([aux_df, data_aux], axis = 1)
c['A_D'] = c['A_D'].astype(str)
c['Operator'] = c['Operator'].astype(str)
c['Terminal'] = c['Terminal'].astype(str)

c['hour'] = pd.to_datetime(c['START'], format='%H:%M').dt.time
c['hour_aux'] = pd.to_datetime(c['START'] - pd.Timedelta(15, unit='m'), 
format='%H:%M').dt.time
c['start_day'] = c['START'].astype(str).str[0:10]
c['end_day'] = c['END'].astype(str).str[0:10]
c['x'] = c.START -  pd.to_timedelta(c.tot.astype(int), unit='m')
c["a"] = 0
c["Already_linked"] = np.where(c.TROUND_ID != "        ", 1 ,0)

arr = c[c['A_D'] == 'A']



from Python - For loop millions of rows

No comments:

Post a Comment