Thursday, 1 July 2021

Building a Numpy array by appending data (without knowing the full size in advance)

I have many files, and each file is read as a matrix of shape (n, 1000), where n may be different from file to file.

I'd like to concatenate all of them into a single big Numpy array. I currently do this:

dataset = np.zeros((100, 1000))
for f in glob.glob('*.png'):
    x = read_as_numpyarray(f)    # custom function; x is a matrix of shape (n, 1000)
    dataset = np.vstack((dataset, x))

but it is inefficient, since I redefine dataset many times by stacking the existing array with the next file that is read.

How to do this in a better way with Numpy, avoiding that the whole dataset is rewritten in memory many times?

NB: the final big Numpy array might take 10 GB.



from Building a Numpy array by appending data (without knowing the full size in advance)

No comments:

Post a Comment