Friday, 21 June 2019

Speeding Up Excel Data to Pandas

I have a really simple bit of code, where I have a group of file names and I need to open each one and extract some data to later manipulate.

for file in unique_file_names[1:]:
        file_name = rootdir + "/" + str(file)
        test_time = time.clock()    
        try:
            wb_loop = load_workbook(file_name, read_only=True, data_only=True)
            ws_loop = wb_loop["SHEET1"]
            df = pd.DataFrame(ws_loop.values)
            print("Opening Workbook:         ", time.clock()-test_time)

            newarray = np.vstack((newarray, df.loc[4:43,:13].values))
            print("Data Manipulation:         ", time.clock()-test_time)

So I've tried a few different modules to read in excel files, including directly using pandas.read_excel() and this is the optimum method, managing to get the time to open the workbook to 1.5-2s, and the numpy stacking takes 0.03 seconds ish.

I think allocating the data to a third dimension in the array based on an index would probably be quicker but I'm more focused on speeding up the time to load the spreadsheets, any suggestions?

Edit: I did also create a multithread pool to try and speed this up but for some reason it started using 15Gb ram and crashed my computer



from Speeding Up Excel Data to Pandas

No comments:

Post a Comment