Friday, 13 August 2021

A faster way to fill nan values with average

I had the previous question closed so reposting it with more context. I was running this command on a dataset that is relatively large (59 gb). With a shape of (800,000, 10,500), I noticed that running on my aws ec2 instance df.fillna(df.mean()) was taking extraordinarily long, after 4 hours I just cancelled the cell from running. Is there a faster way of calculating the average and filling each nan value of each column?

Here is a sample set of the data

d = {'B19325_038E': {409606: 9.0, 403811: 53.0, 400166: 17.0, 402573: 105.0, 400130: 43.0, 404907: 21.0, 406751: 15.0, 403850: 39.0, 404089: 81.0, 409843: np.nan}, 'B08302_014E': {409606: 2.0, 403811: 156.0, 400166: 64.0, 402573: 211.0, 400130: 140.0, 404907: 90.0, 406751: 148.0, 403850: 71.0, 404089: 341.0, 409843: 91.0}, 'B17010I_026E': {409606: np.nan, 403811: 9.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: 21.0}, 'B17015_009E': {409606: 30.0, 403811: 18.0, 400166: 12.0, 402573: 5.0, 400130: 6.0, 404907: 11.0, 406751: 23.0, 403850: 49.0, 404089: 37.0, 409843: 60.0}, 'B06003_004E': {409606: 1552.0, 403811: 3562.0, 400166: 2536.0, 402573: 4911.0, 400130: 1913.0, 404907: 1888.0, 406751: 4264.0, 403850: 2087.0, 404089: 1443.0, 409843: 867.0}, 'B15001_038E': {409606: 46.0, 403811: 104.0, 400166: 89.0, 402573: 120.0, 400130: 61.0, 404907: 14.0, 406751: 60.0, 403850: 198.0, 404089: 97.0, 409843: 25.0}, 'B08130_006E': {409606: 280.0, 403811: 2325.0, 400166: 1381.0, 402573: 2907.0, 400130: 1300.0, 404907: 1528.0, 406751: 2502.0, 403850: 1278.0, 404089: 1986.0, 409843: 308.0}, 'B19201_002E': {409606: 80.0, 403811: 75.0, 400166: 24.0, 402573: 54.0, 400130: np.nan, 404907: np.nan, 406751: 43.0, 403850: 62.0, 404089: 32.0, 409843: 33.0}, 'B19325_087E': {409606: 35.0, 403811: 29.0, 400166: 33.0, 402573: 72.0, 400130: 20.0, 404907: np.nan, 406751: 39.0, 403850: 40.0, 404089: 40.0, 409843: 5.0}, 'B06003_008E': {409606: 106.0, 403811: 458.0, 400166: 296.0, 402573: 505.0, 400130: 277.0, 404907: 804.0, 406751: 1037.0, 403850: 726.0, 404089: 1854.0, 409843: 80.0}, 'B16006_003E': {409606: 30.0, 403811: 525.0, 400166: 160.0, 402573: 33.0, 400130: 386.0, 404907: 2.0, 406751: 55.0, 403850: 121.0, 404089: 686.0, 409843: 228.0}, 'C14007A_004E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_005E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_003E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C21001I_003E': {409606: 31.0, 403811: 287.0, 400166: 86.0, 402573: 25.0, 400130: 235.0, 404907: 35.0, 406751: 32.0, 403850: 73.0, 404089: 384.0, 409843: 84.0}, 'C21001I_006E': {409606: np.nan, 403811: 35.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: 13.0, 403850: 17.0, 404089: 19.0, 409843: 6.0}}

df = pd.DataFrame(data=d)

Here is a picture of my machine using htop to show you the status of where it is at when it is running df.fillna(df.mean()

enter image description here

As you can see it appears to be working, but I do not see the memory fluctuating at all and thus could be frozen? Hard to tell and a waste of money to let it just keep going more than 4 hours.

Is there a way of parallelizing df.fillna(df.mean()) to make it run faster?

To provide even more context here is what I am currently trying since up until now, no one seems to know.

def fill_nan(df, col):
    df[col].fillna(df[col].mean(),inplace=True)
    return df

col_list=all_data.columns.tolist()
l = Parallel(n_jobs=-1)(delayed(fill_nan)(df=all_data,col=cols) for cols in col_list)

The problem with this is I am getting this error

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGSEGV(-11)}

Despite the error, would this method actually make the computation faster?



from A faster way to fill nan values with average

No comments:

Post a Comment