Tuesday, 1 December 2020

Randomly selected file take longer to load with numpy.load than sequential ones

Context

While training a neural network I realised the time spent per batch increased when I increased the size of my dataset (without changing the batch size). The important part is, I need to fetch 20 .npy files per data point, this number doesn't depend on the dataset size.

Problem

Training goes from 2s/iteration to 10s/iteration... There is no apparent reason why training would take longer. However, I managed to track down the bottleneck. It seems to have to do with the loading of the .npy files.


To reproduce this behavior, here's a small script you can run to generate 10,000 dummy .npy files:

def path(i):
    return os.sep.join(('../datasets/test', str(i)))

def create_dummy_files(N=10000):
    for i in range(N):
        x = np.random.random((100, 100))
        np.save(path(random.getrandbits(128)), x)

Then you can run the following two scripts and compare them yourself:

  • The first script where 20 .npy files are randomly selected and loaded:

    L = os.listdir('../datasets/test')
    S = random.sample(L, 20)
    for s in S:
        np.load(path(s)) # <- timed this
    
  • The second version where 20 .npy 'sequential' files are selected and loaded.

    L = os.listdir('../datasets/test')
    i = 100
    S = L[i: i + 20]
    for s in S:
        np.load(path(s)) # <- timed this
    

I tested both scripts and ran them 100 times each (in the 2nd script I used the iteration count as the value for i so the same files are not loaded twice). I wrapped the np.load(path(s)) line with time.time() calls. I'm not timing the sampling, only the loading. Here are the results:

  • Random loads (times roughly stay between 0.1s and 0.4s, average is 0.25s):

    enter image description here

  • Non random loads (times roughly stay between 0.010s and 0.014s, average is 0.01s):

    enter image description here


I'm assuming those times are related to the CPU's activity when the scripts are loaded. However it doesn't explain this gap. Why are these two results so different? Is there something to do with the way files are indexed?

Edit: I printed S in the random sample script, copied the list of 20 filenames then ran it again with S as an list literally defined. The time it took is comparable to the 'sequential' script. Which means it's not related to files not being sequential in the fs or anything. It seems the random sampling gets counted in the timer, yet timing is defined as:

t = time.time()
np.load(path(s))
print(time.time() - t)

I tried as well wrapping np.load (exclusively) with cProfile: same result.



from Randomly selected file take longer to load with numpy.load than sequential ones

No comments:

Post a Comment