ESA Project Results Repository: Generating STAC collections with multiple assets

This notebook shows how to generate a valid STAC collection, which is a requirement to upload research outcomes to the ESA Project Results Repository (PRR). It focuses on generating metadata for a project with multiple data files of different types.

Check the EarthCODE documentation, and PRR STAC introduction example for a more general introduction to STAC and the ESA PRR.

The code below demonstrates how to perform the necessary steps using real data from the ESA project Terrestrial Carbon Community Assimilation System (TCCAS). The focus of TCCAS is the combination of a diverse array of observational data streams with the D&B terrestrial biosphere model into a consistent picture of the terrestrial carbon, water, and energy cycles.

🔗 Check the project website: Terrestrial Carbon Community Assimilation System (TCCAS) – Website

🛢️ TCCAS Dataset: Terrestrial Carbon Community Assimilation System (TCCAS) – Data base: Sodankylä and Lapland region

# import libraries
import xarray as xr
from pystac import Item, Collection
import pystac
from datetime import datetime
from shapely.geometry import box, mapping
import glob
import json
import shapely
import numpy as np

1. Generate the parent collection¶

The root STAC Collection provides a general description of all project outputs which will be stored on the PRR. The PRR STAC Collection template enforces some required fields that you need to provide in order to build its valid description. Most of these metadata fields should already be available and can be extracted from your data.

# create the parent collection
collectionid = "tccas-sodankylae"


collection = Collection.from_dict(
    
{
  "type": "Collection",
  "id": collectionid,
  "stac_version": "1.1.0",
  "title": "Terrestrial Carbon Community Assimilation System: Database for Lapland and Sodankyla region",
  "description": "The Terrestrial Carbon Community Assimilation System (TCCAS) is built around the coupled D&B terrestrial biosphere model. D&B has been newly developed based on the well-established DALEC and BETHY models and builds on the strengths of each component model. In particular, D&B combines the dynamic simulation of the carbon pools and canopy phenology of DALEC with the dynamic simulation of water pools, and the canopy model of photosynthesis and energy balance of BETHY. D&B includes a set of observation operators for optical as well as active and passive microwave observations. The focus of TCCAS is the combination of this diverse array of observational data streams with the D&B model into a consistent picture of the terrestrial carbon, water, and energy cycles. TCCAS applies a variational assimilation approach that adjusts a combination of initial pool sizes and process parameters to match the observational data streams. This dataset includes Satelite, Field and model forcing data sets for Sodankylä and Lapland region.",
  "extent": {
    "spatial": {
      "bbox": [
        [
          18.00,
          65.00,
          32.00,
          69.00
        ]
      ]
    },
    "temporal": {
      "interval": [
        [
          "2011-01-01T00:00:00Z",
          "2021-12-31T00:00:00Z"
        ]
      ]
    }
  },
  "license": "various",
  "links": []

}

)

collection # visualise the metadata of your collection

2. Create STAC Items and STAC Assets from original dataset¶

The second step is to describe the different files as STAC Items and Assets. Take your time to decide how your data should be categorised to improve usability of the data, and ensure intuitive navigation through different items in the collections. There are multiple strategies for doing this and this tutorial demonstrate one of the possible ways of doing that. Examples of how other ESA projects are doing this are available in the EarthCODE documentation

2.1 Create STAC Item from Satellite Dataset¶

# define dataset names and base url. If the data is locally stored, then you have to adjust these paths.


root_url = 'https://lcc.inversion-lab.com/data/eo/'
data_files = {
    "Fraction of absorbed Photosynthetic Active Radiation Leaf Area Index (JRC-TIP)": "/jrc-tip/jrctip_fapar-lai_sodankyla_20110101-20220105.nc",
    "Brightness temperature (SMOS TB)": "smos/smos_l3tb/SMOS_L3TB__sodankyla.nc",
    "Soil moisture and Vegetation Optical Depth (SMOS SM and SMOS L-VOD)": "smos/smosL2/smosL2_1D_v700_sodankyla_trans.nc",
    "Solar Induced Chlorophyll Fluorescence (Sentinel 5P)": "sif/tropomi/Sodankyla_SIF_TROPOMI_final.nc4",
    "Slope (ASCAT Slope)": "ascat/local_slope.final/ASCAT_slope_so.nc",
    "Photochemical Reflectance Index (MODIS PRI)": "modis/final/PRI_ESTIMATE_SODANKYLA_SINUSOIDAL.nc",
    "Land Surface Temperature (MODIS LST)": "modis/final/LST_ESTIMATE_SODANKYLA_SINUSOIDAL.nc",
    "Solar Induced Chlorophyll Fluorescence (OCO-2 SIF)": "sif/oco2/Sodankyla_SIF_OCO2_final.nc4",
    "Vegetation Optical Depth (AMSR-2 VOD)": "amsr2/final/AMSR2_so.nc"
}

# fix the same bbox and geometry for all items in the region
bbox = [18.00, 65.00, 32.00, 69.00]
geometry = json.loads(json.dumps(shapely.box(*bbox).__geo_interface__))

# some attributes extracted from xarray are not json serialisable and have to be cast to other types.
def convert_to_json_serialisable(attrs):
    attrs = attrs.copy()
    for attr in attrs.keys():
        if isinstance(attrs[attr], np.ndarray):
            attrs[attr] = attrs[attr].tolist()
        elif str(type(attrs[attr])).__contains__('numpy.int'):
            attrs[attr] = int(attrs[attr])
    return attrs

# for each dataset create an item
for dataset_name, dataset_filepath in data_files.items():

    # 1. open the netcdf file
    ds = xr.open_dataset(root_url + dataset_filepath + '#mode=bytes')

    if 'time' in ds.coords:
        start_time = ds['time'][0].values
        ts = (start_time - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
        start_time = datetime.fromtimestamp(ts)
    elif 'yymmddHH' in ds.variables:
        string_date = '-'.join(ds['yymmddHH'][0].values.astype(str)[:3])
        start_time = datetime.strptime(string_date, '%Y-%m-%d')
    else:
        string_date = '-'.join(ds['yymmddHHMMSS'][0].values.astype(int).astype(str)[:3])
        start_time = datetime.strptime(string_date, '%Y-%m-%d')

    # 3. Create a STAC item with the extracted properties
    item = Item(
        id=f"{collection.id}-{dataset_name.lower().replace(' ', '_')}",
        geometry=geometry,
        datetime=start_time,
        bbox=bbox,
        properties= {
            "license": ds.attrs['license'],
            "description": f'Dataset with variables related to {dataset_name}.',
        }
    )

    if len(item.properties['license']) > 20:
        item.properties['license'] = 'TIP-FAPAR-1.7'

    # 3. add an asset (the actual link to the file)
    item.add_asset(
                key=f'Dataset with variables related to {dataset_name}.', # title can be arbitrary
                asset=pystac.Asset(
                    href=f'/d/{collectionid}/{dataset_filepath.split('/')[-1]}',
                    media_type="application/x-netcdf",
                    roles=["data"],
                )
    )

    # 4. Extract variable information
    for v in ds.variables:
        item.properties[f"variable_{v}"] = convert_to_json_serialisable(ds.variables[v].attrs)

    item.validate()

    # 5. Add the item to the collection
    collection.add_item(item)

/tmp/ipykernel_47640/2301298067.py:9: UserWarning: no explicit representation of timezones available for np.datetime64
  ts = (start_time - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
/tmp/ipykernel_47640/2301298067.py:5: FutureWarning: In a future version of xarray decode_timedelta will default to False rather than None. To silence this warning, set decode_timedelta to True, False, or a 'CFTimedeltaCoder' instance.
  ds = xr.open_dataset(root_url + dataset_filepath + '#mode=bytes')
/tmp/ipykernel_47640/2301298067.py:9: UserWarning: no explicit representation of timezones available for np.datetime64
  ts = (start_time - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')

2.2 Create STAC Item from In situ Dataset¶

# add a single item with all the in-situ data, since it comes in a single .tgz file
item = Item(
    id=f"{collection.id}-insitu_package",
    geometry=geometry,
    datetime=start_time,
    bbox=bbox,
    properties= {
        "license": "CC-BY-4.0",
        "description": 'Insitu package with FloX, VOD and Miscellaneous field datasets related to the TCCAS project. ',
    }
)

# 3. add an asset (the actual link to the file)
item.add_asset(
            key=f'Insitu package', # title can be arbitrary
            asset=pystac.Asset(
                href=f'/d/{collectionid}/sodankyla-insitu-package.tgz',
                media_type="application/tar+gzip",
                roles=["data"],
            )
)

item.validate()
collection.add_item(item)

2.3 Create STAC Item from Model based Dataset¶

# add an item with multiple model forcing assets
item = Item(
    id=f"{collectionid}-model_forcing",
    geometry=geometry,
    datetime=start_time,
    bbox=bbox,
    properties= {
        "license": "CC-BY-4.0",
        "description": ' Regional and Site-level model forcing Data Sets for Sodankylä and Lapland region, part of the TCCAS project.',
    }
)


# 3. add an asset (the actual link to the file)
item.add_asset(
            key=f'static-site-level', # title can be arbitrary
            asset=pystac.Asset(
                href=f'/d/{collectionid}/FI-Sod_staticforcing.nc',
                media_type="application/x-netcdf",
                roles=["data"],
            )
)

item.add_asset(
            key=f'time-dependent (ERA5) - site level', # title can be arbitrary
            asset=pystac.Asset(
                href=f'/d/{collectionid}/FI-Sod_dynforcing-era5_20090101-20211231_with-lwdown.nc',
                media_type="application/x-netcdf",
                roles=["data"],
            )
)

item.add_asset(
            key=f'time-dependent (in-situ) - site level', # title can be arbitrary
            asset=pystac.Asset(
                href=f'/d/{collectionid}/FI-Sod_dynforcing-insitu_20090101-20211231_with-insitu-lwdown.nc',
                media_type="application/x-netcdf",
                roles=["data"],
            )
)

item.add_asset(
            key=f'static-regional', # title can be arbitrary
            asset=pystac.Asset(
                href=f'/d/{collectionid}/sodankyla-region_cgls-pft-crops-redistributed_staticforcing.nc',
                media_type="application/x-netcdf",
                roles=["data"],
            )
)

item.add_asset(
            key=f'time-dependent (ERA5) - regional', # title can be arbitrary
            asset=pystac.Asset(
                href=f'/d/{collectionid}/sodankyla-region_dynforcing_era5_2009-2021.nc',
                media_type="application/x-netcdf",
                roles=["data"],
            )
)

item.validate()
collection.add_item(item)

2.4 Create STAC Item for the documentation and add to the Collection¶

# add all the documentation under a single item
item = Item(
    id=f"{collectionid}-documentation",
    geometry=geometry,
    datetime=start_time,
    bbox=bbox,
    properties= {
        "license": "CC-BY-4.0",
        "description": 'Documentation for the TCCAS project datasets.',
    }
)

item.add_asset(
            key=f'TCCAS user manual.', # title can be arbitrary
            asset=pystac.Asset(
                href=f'/d/{collectionid}/TCCAS_manual.pdf',
                media_type="application/pdf",
                roles=["documentation"],
            )
)

item.add_asset(
            key="Satellite Data Uncertainty analysis Scientific Report", # title can be arbitrary
            asset=pystac.Asset(
                href=f'/d/{collectionid}/D7.pdf',
                media_type="application/pdf",
                roles=["documentation"],
            )
)

item.add_asset(
            key="Campaign Data User Manual", # title can be arbitrary
            asset=pystac.Asset(
                href=f'/d/{collectionid}/D11_CDUM-all_sites.pdf',
                media_type="application/pdf",
                roles=["documentation"],
            )
)

collection.add_item(item)

3. Save the metadata as a self-contained collection¶

# save the full self-contained collection
collection.normalize_and_save(
    root_href='./data/example_catalog/',
    catalog_type=pystac.CatalogType.SELF_CONTAINED
)

collection

Acknowledgments¶

We gratefully acknowledge the Terrestrial Carbon Community Assimilation System (TCCAS) team for providing access to the data used in this example, as well as support in creating it.

EarthCODE Examples

ESA Project Results Repository: Generating STAC collections with zarr files

EarthCODE Examples

Example from SRAL Processing over Land Ice Dataset