Hemant Vishwakarma: Why is it faster to read whole hdf5 dataset than a slice

Monday, 26 November 2018

Why is it faster to read whole hdf5 dataset than a slice

I'm trying to figure out why this happens:

In [1]: import time, h5py as h5
In [2]: f = h5.File('myfile.hdf5', 'r')                                                                                                                                    
In [3]: st = time.time(); data = f["data"].value[0,:,1,...]; elapsed = time.time() - st;
In [4]: elapsed
Out[4]: 11.127676010131836
In [5]: st = time.time(); data = f["data"][0,:,1,...]; elapsed2 = time.time() - st;
In [6]: elapsed2
Out[6]: 59.810582399368286
In [7]: f["data"].shape
Out[7]: (1, 4096, 6, 16, 16, 16, 16)
In [8]: f["data"].chunks
Out[8]: (1, 4096, 1, 16, 16, 16, 16)

As you can see, loading the whole dataset into memory and then taking a slice is faster than taking that same slice from the dataset.

The chunk size matches the slice, so it should all be contiguous memory, right? Why then is it so much slower?

The dataset is compressed with gzip (opts=2)

from Why is it faster to read whole hdf5 dataset than a slice

Hemant Vishwakarma

Monday, 26 November 2018

Why is it faster to read whole hdf5 dataset than a slice

No comments:

Post a Comment