Goal: Produce a time-indexed Zarr data cube of forest-loss masks for the Amazonas region to enable scalable analysis and visualization. This example is based on the Sentinel for amazonas project project.
Classes:
1— forest loss0— no data / background (treated as missing)
Study area and grid:
CRS:
EPSG:4326Bounds:
-79.1997438, -17.27354892, -44.91217178, 9.0467434Resolution:
0.00017966degrees (seegdalwarpstep)
Workflow overview:
Convert each GeoTIFF mosaic to per-date Zarr with
gdalwarp.Build a sorted list of Zarr stores and timestamps.
Initialize an empty target cube (
cube.zarr) with the full time axis.Append each slice into the cube using
regionwrites.Tune chunking for performance and memory.
Requirements:
GDAL built with Zarr driver (
gdalwarp -of Zarr)Python:
xarray,rioxarray,zarr,dask[distributed],numpy,pandas, optionallyhvplot.xarrayfor quick QA
Data layout:
Input:
mosaics/*.tif(per-date masks)Intermediate:
mosaics/*.zarr(per-date stores)Output:
cube.zarr(time-series cube)
Discussion¶
When handling many temporal slices in Earth Observation, storing them as a single Zarr data cube is far superior to managing hundreds of COGs linked via a STAC collection because Zarr treats time as a first-class dimension within a unified array structure rather than as disconnected files. This enables efficient temporal queries and parallel access using chunked, compressed blocks that map directly to analysis patterns like pixel-wise time series or rolling statistics, all without the per-file latency, catalog management, and HTTP overhead that cripple COG-based stacks at scale.
To learn more read the EarthCODE data best practices pages
1) Convert GeoTIFFs to Zarr with GDAL¶
Use GDAL to warp each input GeoTIFF to EPSG:4326, clip to the Amazonas extent, set the target resolution, and write per-date Zarr stores. Adjust IN_DIR, OUT_DIR, bounds (-te), resolution (-tr), and compression to match your data and hardware.
Tip: Confirm Zarr support with gdalinfo --formats | grep -i zarr.
%%bash
#!/usr/bin/env bash
set -euo pipefail
IN_DIR="mosaics"
OUT_DIR="mosaics"
for tif in "${IN_DIR}"/*.tif; do
base="$(basename "${tif}" .tif)"
out="${OUT_DIR}/${base}.zarr"
echo "Warping ${tif} -> ${out}"
gdalwarp -overwrite \
-t_srs EPSG:4326 \
-te -79.1997438 -17.27354892 -44.91217178 9.0467434 \
-tr 0.00017966 0.00017966 \
-r near \
-dstnodata 0 \
"${tif}" \
"${out}" \
-of Zarr \
-multi -wo NUM_THREADS=ALL_CPUS \
-co COMPRESS=ZSTD -co ZSTD_LEVEL=3 -co BLOCKSIZE=512,512
done2) Python setup and file discovery¶
Import libraries, set the input folder (mosaics/), and collect per-date Zarr stores produced in step 1. Timestamps are parsed from filenames like prefix_YYYYMMDD.zarr. If your naming differs, adjust the parsing logic accordingly.
import os
import rioxarray
import xarray as xr
import hvplot.xarray
import pandas as pd
from pathlib import Path
folder = "mosaics"
files = [os.path.join(folder, f) for f in os.listdir(folder) if f.endswith(".zarr")]
# print(files)
timestamps = [
pd.to_datetime(os.path.splitext(Path(f).name)[0].split("_")[1]) for f in files
]
# timestampsSort files by timestamp¶
Ensure files and timestamps are aligned chronologically for consistent time indexing.
import numpy as np
ii = np.argsort(timestamps)
timestamps = np.array(timestamps)[ii]
files = np.array(files)[ii]
# filesOptional: Start a Dask client¶
Spins up a local Dask scheduler/workers to parallelize IO and computation.
from distributed.client import Client
cl = Client()
cl3) Initialize target cube (template)¶
Define chunk sizes and build a template dataset containing only the time coordinate. This pre-allocates the cube.zarr store with the desired chunking.
Note: Adjust chunk sizes (x, y, time) to balance memory, IO, and parallelism for your environment.
%%time
chunk_size = {"time": 1,"x": 512*10, "y": 512*10}
temp_data = da.expand_dims({'time': timestamps.tolist()})
temp_das = xr.Dataset({'forest_loss': temp_data})
temp_das = temp_das.chunk(chunk_size)
temp_dasWrite template to cube.zarr¶
Create the target store with the full time axis and chosen chunks without writing data yet (compute=False).
# write tempalte data that has al the timestamps
temp_das.forest_loss.attrs.pop("_FillValue")
enc = {
"forest_loss": {
"_FillValue": np.nan,
"dtype": np.float32,
}
}
temp_das.to_zarr('cube.zarr', mode='w', compute=False, encoding=enc, write_empty_chunks=False,
safe_chunks=True)Chunking strategy¶
Choose chunk sizes that fit in memory and align with access patterns (typical: larger
x/y, smalltime).You can override chunks when opening inputs, e.g.:
xr.open_dataset(f, engine="zarr", chunks={"X": 512*10, "Y": 512*10}).Keep chunk shapes consistent across inputs and the target cube for best performance.
4) Append slices to the cube¶
For each per-date Zarr store:
Open with
xarrayand rename dimsX/Ytox/y.Convert
0to missing (NaN) so background doesn’t accumulate and skip writing empty chunksAttach CRS (
EPSG:4326) and set spatial dims viarioxarray.Expand with a singleton
timecoordinate and write intocube.zarrat the matchingtimeindex usingregion.
for i, (date, f) in enumerate(zip(timestamps, files)):
da = xr.open_dataset(f,
engine="zarr", chunks={"X": 512*10, "Y": 512*10})
da = da[next(iter(da.data_vars))]
da = da.rename({"X":"x", "Y":"y"})
da.name = "forest_loss"
da = da.where(da != 0)
da = da.rio.write_crs("EPSG:4326")
da = da.rio.set_spatial_dims(x_dim="x", y_dim="y") # <-- solves MissingSpatialDimensionError
da = da.expand_dims({'time': [date]})
da = da.chunk(chunk_size)
da.forest_loss.attrs.pop("_FillValue")
enc = {
"forest_loss": {
"_FillValue": np.nan,
"dtype": np.float32,
}
}
da.drop_vars(['x', 'y', 'spatial_ref']).to_zarr('cube.zarr', encoding=enc, write_empty_chunks=False, safe_chunks=True, region={'time': slice(i, i+1)})
print(i)