Hemant Vishwakarma: Speeding Up Excel Data to Pandas

Friday, 21 June 2019

Speeding Up Excel Data to Pandas

I have a really simple bit of code, where I have a group of file names and I need to open each one and extract some data to later manipulate.

for file in unique_file_names[1:]:
        file_name = rootdir + "/" + str(file)
        test_time = time.clock()    
        try:
            wb_loop = load_workbook(file_name, read_only=True, data_only=True)
            ws_loop = wb_loop["SHEET1"]
            df = pd.DataFrame(ws_loop.values)
            print("Opening Workbook:         ", time.clock()-test_time)

            newarray = np.vstack((newarray, df.loc[4:43,:13].values))
            print("Data Manipulation:         ", time.clock()-test_time)

So I've tried a few different modules to read in excel files, including directly using pandas.read_excel() and this is the optimum method, managing to get the time to open the workbook to 1.5-2s, and the numpy stacking takes 0.03 seconds ish.

I think allocating the data to a third dimension in the array based on an index would probably be quicker but I'm more focused on speeding up the time to load the spreadsheets, any suggestions?

Edit: I did also create a multithread pool to try and speed this up but for some reason it started using 15Gb ram and crashed my computer

from Speeding Up Excel Data to Pandas

Hemant Vishwakarma

Friday, 21 June 2019

Speeding Up Excel Data to Pandas

No comments:

Post a Comment