Hemant Vishwakarma: Zarr: improve xarray writing performance to S3

Friday, 26 March 2021

Zarr: improve xarray writing performance to S3

Writing xarray datasets to AWS S3 takes a surprisingly big amount of time, even when no data is actually written with compute=False.

Here's an example:

import fsspec
import xarray as xr

x = xr.tutorial.open_dataset("rasm")
target = fsspec.get_mapper("s3://bucket/target.zarr")
task = x.to_zarr(target, compute=False)

Even without actually computing it, to_zarr takes around 6 seconds from an EC2 that's in the same region as the S3 bucket.

Looking at the debug logs, there seems to be quite a bit of redirecting going on, as the default region in aiobotocore is set to us-east-2 while the bucket is in eu-central-1.

If I first manually put the default region into the environment variables with

os.environ['AWS_DEFAULT_REGION'] = 'eu-central-1'

then the required time drops to around 3.5 seconds.

So my questions are:

Is there any way to pass the region to fsspec (or s3fs)? I've tried adding s3_additional_kwargs={"region":"eu-central-1"} to the get_mapper method, but that didn't do anything.
Is there any better way to interface with zarr on S3 from xarray than the above (with fsspec)?

versions:

xarray: 0.17.0
zarr: 2.6.1
fsspec: 0.8.4

from Zarr: improve xarray writing performance to S3

Hemant Vishwakarma

Friday, 26 March 2021

Zarr: improve xarray writing performance to S3

No comments:

Post a Comment