I have a really simple bit of code, where I have a group of file names and I need to open each one and extract some data to later manipulate.
for file in unique_file_names[1:]:
file_name = rootdir + "/" + str(file)
test_time = time.clock()
try:
wb_loop = load_workbook(file_name, read_only=True, data_only=True)
ws_loop = wb_loop["SHEET1"]
df = pd.DataFrame(ws_loop.values)
print("Opening Workbook: ", time.clock()-test_time)
newarray = np.vstack((newarray, df.loc[4:43,:13].values))
print("Data Manipulation: ", time.clock()-test_time)
So I've tried a few different modules to read in excel files, including directly using pandas.read_excel() and this is the optimum method, managing to get the time to open the workbook to 1.5-2s, and the numpy stacking takes 0.03 seconds ish.
I think allocating the data to a third dimension in the array based on an index would probably be quicker but I'm more focused on speeding up the time to load the spreadsheets, any suggestions?
Edit: I did also create a multithread pool to try and speed this up but for some reason it started using 15Gb ram and crashed my computer
from Speeding Up Excel Data to Pandas
No comments:
Post a Comment