I am trying to efficiently restructure a large multidimentional dataset. Let assume I have a number of remotely sensed images over time with a number of bands with coordinates x y for pixel location, time for time of image acquisition, and band for different data collected.
In my use case lets assume the xarray coord lengths are roughly x (3000), y (3000), time (10), with bands (40) of floating point data. So 100gb+ of data.
I have been trying to work from this example but I am having trouble translating it to this case.
Small dataset example
NOTE: the actual data is much larger than this example.
import numpy as np
import dask.array as da
import xarray as xr
nrows = 100
ncols = 200
row_chunks = 50
col_chunks = 50
data = da.random.random(size=(1, nrows, ncols), chunks=(1, row_chunks, col_chunks))
def create_band(data, x, y, band_name):
return xr.DataArray(data,
dims=('band', 'y', 'x'),
coords={'band': [band_name],
'y': y,
'x': x})
def create_coords(data, left, top, celly, cellx):
nrows = data.shape[-2]
ncols = data.shape[-1]
right = left + cellx*ncols
bottom = top - celly*nrows
x = np.linspace(left, right, ncols) + cellx/2.0
y = np.linspace(top, bottom, nrows) - celly/2.0
return x, y
x, y = create_coords(data, 1000, 2000, 30, 30)
src = []
for time in ['t1', 't2', 't3']:
src_t = xr.concat([create_band(data, x, y, band) for band in ['blue', 'green', 'red', 'nir']], dim='band')\
.expand_dims(dim='time')\
.assign_coords({'time': [time]})
src.append(src_t)
src = xr.concat(src, dim='time')
print(src)
<xarray.DataArray 'random_sample-5840d8564d778d573dd403f27c3f47a5' (time: 3, band: 4, y: 100, x: 200)>
dask.array<concatenate, shape=(3, 4, 100, 200), dtype=float64, chunksize=(1, 1, 50, 50), chunktype=numpy.ndarray>
Coordinates:
* x (x) float64 1.015e+03 1.045e+03 1.075e+03 ... 6.985e+03 7.015e+03
* band (band) object 'blue' 'green' 'red' 'nir'
* y (y) float64 1.985e+03 1.955e+03 1.924e+03 ... -984.7 -1.015e+03
* time (time) object 't1' 't2' 't3'
Restructured - stacked and transposed
I need to store the output of the following:
print(src.stack(sample=('y','x','time')).T)
<xarray.DataArray 'random_sample-5840d8564d778d573dd403f27c3f47a5' (sample: 60000, band: 4)>
dask.array<transpose, shape=(60000, 4), dtype=float64, chunksize=(3600, 1), chunktype=numpy.ndarray>
Coordinates:
* band (band) object 'blue' 'green' 'red' 'nir'
* sample (sample) MultiIndex
- y (sample) float64 1.985e+03 1.985e+03 ... -1.015e+03 -1.015e+03
- x (sample) float64 1.015e+03 1.015e+03 ... 7.015e+03 7.015e+03
- time (sample) object 't1' 't2' 't3' 't1' 't2' ... 't3' 't1' 't2' 't3'
I am hoping to use dask and xarray to write the result to disk (ideally hdf, but hdf can't store multiindex created by stacking multiple coords, in this case x,y,time), or in a format accessible for open_mfdataset since I will need to use lazy loading.
At this point I am just hoping for some good ideas. Thanks!
from Writing xarray multiindex data in chunks
No comments:
Post a Comment