Skip to article frontmatterSkip to article content

Sen4Amazonas: Time-Series Forest Loss Zarr Cube

Goal: Produce a time-indexed Zarr data cube of forest-loss masks for the Amazonas region to enable scalable analysis and visualization. This example is based on the Sentinel for amazonas project project.

Classes:

  • 1 — forest loss

  • 0 — no data / background (treated as missing)

Study area and grid:

  • CRS: EPSG:4326

  • Bounds: -79.1997438, -17.27354892, -44.91217178, 9.0467434

  • Resolution: 0.00017966 degrees (see gdalwarp step)

Workflow overview:

  1. Convert each GeoTIFF mosaic to per-date Zarr with gdalwarp.

  2. Build a sorted list of Zarr stores and timestamps.

  3. Initialize an empty target cube (cube.zarr) with the full time axis.

  4. Append each slice into the cube using region writes.

  5. Tune chunking for performance and memory.

Requirements:

  • GDAL built with Zarr driver (gdalwarp -of Zarr)

  • Python: xarray, rioxarray, zarr, dask[distributed], numpy, pandas, optionally hvplot.xarray for quick QA

Data layout:

  • Input: mosaics/*.tif (per-date masks)

  • Intermediate: mosaics/*.zarr (per-date stores)

  • Output: cube.zarr (time-series cube)

Discussion

When handling many temporal slices in Earth Observation, storing them as a single Zarr data cube is far superior to managing hundreds of COGs linked via a STAC collection because Zarr treats time as a first-class dimension within a unified array structure rather than as disconnected files. This enables efficient temporal queries and parallel access using chunked, compressed blocks that map directly to analysis patterns like pixel-wise time series or rolling statistics, all without the per-file latency, catalog management, and HTTP overhead that cripple COG-based stacks at scale.

To learn more read the EarthCODE data best practices pages

1) Convert GeoTIFFs to Zarr with GDAL

Use GDAL to warp each input GeoTIFF to EPSG:4326, clip to the Amazonas extent, set the target resolution, and write per-date Zarr stores. Adjust IN_DIR, OUT_DIR, bounds (-te), resolution (-tr), and compression to match your data and hardware.

Tip: Confirm Zarr support with gdalinfo --formats | grep -i zarr.

%%bash
#!/usr/bin/env bash
set -euo pipefail

IN_DIR="mosaics"
OUT_DIR="mosaics"

for tif in "${IN_DIR}"/*.tif; do
  base="$(basename "${tif}" .tif)"
  out="${OUT_DIR}/${base}.zarr"

  echo "Warping ${tif} -> ${out}"
  gdalwarp -overwrite \
    -t_srs EPSG:4326 \
    -te -79.1997438 -17.27354892 -44.91217178 9.0467434 \
    -tr 0.00017966 0.00017966 \
    -r near \
    -dstnodata 0 \
    "${tif}" \
    "${out}" \
    -of Zarr \
    -multi -wo NUM_THREADS=ALL_CPUS \
    -co COMPRESS=ZSTD -co ZSTD_LEVEL=3 -co BLOCKSIZE=512,512

done

2) Python setup and file discovery

Import libraries, set the input folder (mosaics/), and collect per-date Zarr stores produced in step 1. Timestamps are parsed from filenames like prefix_YYYYMMDD.zarr. If your naming differs, adjust the parsing logic accordingly.

import os
import rioxarray
import xarray as xr
import hvplot.xarray
import pandas as pd
from pathlib import Path

folder = "mosaics"
files = [os.path.join(folder, f) for f in os.listdir(folder) if f.endswith(".zarr")]
# print(files)

timestamps = [
    pd.to_datetime(os.path.splitext(Path(f).name)[0].split("_")[1]) for f in files
]
# timestamps

Sort files by timestamp

Ensure files and timestamps are aligned chronologically for consistent time indexing.

import numpy as np
ii = np.argsort(timestamps)
timestamps = np.array(timestamps)[ii]
files = np.array(files)[ii]
# files

Optional: Start a Dask client

Spins up a local Dask scheduler/workers to parallelize IO and computation.

from distributed.client import Client
cl = Client()
cl

3) Initialize target cube (template)

Define chunk sizes and build a template dataset containing only the time coordinate. This pre-allocates the cube.zarr store with the desired chunking.

Note: Adjust chunk sizes (x, y, time) to balance memory, IO, and parallelism for your environment.

%%time
chunk_size = {"time": 1,"x": 512*10, "y": 512*10}
temp_data = da.expand_dims({'time': timestamps.tolist()})
temp_das = xr.Dataset({'forest_loss': temp_data})
temp_das = temp_das.chunk(chunk_size)
temp_das

Write template to cube.zarr

Create the target store with the full time axis and chosen chunks without writing data yet (compute=False).

# write tempalte data that has al the timestamps
temp_das.forest_loss.attrs.pop("_FillValue")

enc = {
    "forest_loss": {
        "_FillValue": np.nan,
        "dtype": np.float32,
    }
}

temp_das.to_zarr('cube.zarr', mode='w', compute=False, encoding=enc, write_empty_chunks=False, 
                 safe_chunks=True)

Chunking strategy

  • Choose chunk sizes that fit in memory and align with access patterns (typical: larger x/y, small time).

  • You can override chunks when opening inputs, e.g.: xr.open_dataset(f, engine="zarr", chunks={"X": 512*10, "Y": 512*10}).

  • Keep chunk shapes consistent across inputs and the target cube for best performance.

4) Append slices to the cube

For each per-date Zarr store:

  • Open with xarray and rename dims X/Y to x/y.

  • Convert 0 to missing (NaN) so background doesn’t accumulate and skip writing empty chunks

  • Attach CRS (EPSG:4326) and set spatial dims via rioxarray.

  • Expand with a singleton time coordinate and write into cube.zarr at the matching time index using region.

for i, (date, f) in enumerate(zip(timestamps, files)):
    da = xr.open_dataset(f,
             engine="zarr", chunks={"X": 512*10, "Y": 512*10}) 
    da = da[next(iter(da.data_vars))]
    da = da.rename({"X":"x", "Y":"y"})
    da.name = "forest_loss"
    da = da.where(da != 0)
    da = da.rio.write_crs("EPSG:4326")
    da = da.rio.set_spatial_dims(x_dim="x", y_dim="y")   # <-- solves MissingSpatialDimensionError
    da = da.expand_dims({'time': [date]})
    da = da.chunk(chunk_size)
    da.forest_loss.attrs.pop("_FillValue")

    enc = {
        "forest_loss": {
            "_FillValue": np.nan,
            "dtype": np.float32,
        }
    }
    da.drop_vars(['x', 'y', 'spatial_ref']).to_zarr('cube.zarr', encoding=enc, write_empty_chunks=False, safe_chunks=True, region={'time': slice(i, i+1)})
    print(i)