Generating a STAC Collection for the PRR

Introduction¶

This notebook has been created to show the core steps required of EarthCODE users to upload their research outcomes to the ESA Project Results Repository (PRR). It focuses on generating metadata for a project with a single netcdf file.

PRR provides access to data, workflows, experiments and documentation from ESA Projects organised across Collections, accessible via the STAC API. Each Collection contains STAC Items, with their related Assets stored within the PRR storage. Scientists/commercial companies can access the PRR via the EarthCODE and APEx projects.

The STAC Specification, provides detailed explanation and more information on this metadata format.

In order to upload data to the ESA Project Results Repository (PRR) you have to generate a STAC Collection that is associated to your files. The STAC Collection provides metadata about your files and makes them searchable and machine readable. The metadata generation process is organised in four steps process:

Generate a root STAC Collection
Group your dataset files into STAC Items and STAC Assets
Add the Items to the Collection
Save the normalised Collection

The easiest way to generate all the required files is to use a STAC library, such as pystac or riostac. This library will take care of creating the links and formating the files in the correct way. In the examples below we are using pystac.

Have a look at the steps below and learn how to prepare your dataset to generate a valid STAC Collection. You will find all the steps descibed in the markdown cell, together with the example code (executable) to make this process easier. Please adjust the information in the fields required to describe your Collection and Items according to the comments, starting with : “#”

NOTE: Depending on the information that you put in the Assets or Items the code, you may get an error about an object not being json-serialisable. If this happens, you have to transform the problem field into an object that can be described using standard JSON. For example, transforming a numpy array into a list.

🌊 Example: 4DATLANTIC-OHC Project¶

The code below demonstrates how to perform the necessary steps using real data from the ESA Regional Initiative Project 4DATLANTIC-OHC. The project focuses on ocean heat content and provides monthly gridded Atlantic Ocean heat content change as well as OHC trends and their uncertainties.

🔗 Learn more about the project here: 4DATLANTIC-OHC – EO4Society
🔗 Check the project website: 4DATLANTIC-OHC – Website

This example is intended to help you understand the workflow and apply similar steps to your own Earth observation data analysis.

Import necessary Python libraries¶

You can create an example conda/miniconda enviroment to run the below code using:

conda create -n prr_stack_example pystac xarray shapely
conda activate prr_stack_example

# import libraries
from pystac import Collection
import pystac
import xarray as xr
import shapely
import json
from datetime import datetime

1. Generate a root STAC collection¶

The root STAC Collection provides a general description of the enitre dataset, that you would like to store in ESA PRR. In the STAC Specification a Collection is defined as an extension of the STAC Catalog with additional information such as the extents, license, keywords, providers, etc that describe STAC Items that fall within the Collection.

In short: it behaves as the container to store the various Items that build up your dataset.

STAC Collection has some required fields that you need to provide in order to build its valid description. Most of these metadata fields should be extracted from your data. Please have a look at the example below.

{
  "type": "Collection", # Do not change
  "id": "", # add a unique variation of project name + dataset name 
  "stac_version": "1.1.0", # Do not change
  "title": "", # Meaningful title of your dataset
  "description": "", # General description of your dataset
  "extent": {
    "spatial": {
      "bbox": [
        [
          -180.0,
          -90.0,
          180.0,
          90.0
        ]
      ]
    }, # Spatial extent of your dataset. If you have multiple data files take the minimum bounding box that covers all.
    "temporal": {
      "interval": [
        [
          "1982-01-01T00:00:00Z",
          "2022-12-31T23:59:59Z"
        ] # Temporal extent of your dataset. If you have multiple data files take the minimum temporal range that covers all.
      ]
    }
  },
"license": "", # the license that applies to entire dataset
"links": [] # do not change

}

Example | Create Collection¶

# define collection id, since it will be reused
collectionid = "4datlantic-ohc"

# create the root collection using pystac.Collection

collection = Collection.from_dict(
    
{
  "type": "Collection",
  "id": collectionid,
  "stac_version": "1.1.0",
  "title": "Atlantic Ocean heat content change",
  "description": "Given the major role of the Atlantic Ocean in the climate system, it is essential to characterize the temporal and spatial variations of its heat content. The OHC product results from the space geodetic approach also called altimetry-gravimetry approach. This dataset contains variables as 3D grids of ocean heat content anomalies at 1x1 resolution and monthly time step. Error variance-covariance matrices of OHC at regional scale and annual resolution are also provided. See Experimental Dataset Description for details: https://www.aviso.altimetry.fr/fileadmin/documents/data/tools/OHC-EEI/OHCATL-DT-035-MAG_EDD_V2.0.pdf.Version. V2-0 of Dataset published 2022 in Centre National d’Etudes Spatiales. This dataset has been produced within the framework of the 4DAtlantic-Ocean heat content Project funded by ESA.",
  "extent": {
    "spatial": {
      "bbox": [
        [-100, 
         -90, 
         25,
         90]
      ]
    },
    "temporal": {
      "interval": [
        [
          "2002-04-15T18:07:12Z",
          "2023-09-01T18:59:59Z"
        ]
      ]
    }
  },
  "license": "Aviso License",
  "links": []

}

)

collection

2. Group your dataset files into STAC Items and STAC Assets¶

The second step is to describe the different files as Items and Assets. This is the most time-consuming step. There are multiple strategies for doing this and it is up to you to decide how to do it. The main consideration should be usability of the data.

For example:

Microsoft Planatery Computer groups its Sentinel-2 data into Items which represent individual regions, and each Item has 13 Assets each representing a band - https://stacindex.org/catalogs/microsoft-pc#/43bjKKcJQfxYaT1ir3Ep6uENfjEoQrjkzhd2?cp=1&t=5 .
The California Forest Observatory (on Google Earth Engine) groups its data into Items, where each Item represents a specific year, data type and resolution for the whole study area. Each Item has only one Asset ( dataset ) associated with it - https://stacindex.org/catalogs/forest-observatory#/4dGsSbK8F5jjmhRZYE6kjUMmgWCUKe6J2qqw?t=2.
A More complex example from real-data from ESA-funded project: ESA Projects Results Repository, gives the researchers flexibility in terms on how their datasets will be grouped into Items and Assets. You may need to consider that the more Items you have in your Collection, the slower the browsing would be if the user would like to browse through the publicly open STAC Browser. Please have a look at one example, that provides one Sentinel-3 AMPLI Ice Sheet Elevation Collection with around 400 Items complemented by around 360 Assets each. https://eoresults.esa.int/browser/#/external/eoresults.esa.int/stac/collections/sentinel3-ampli-ice-sheet-elevation
More general examples about creating STAC catalogs are available here - https://github.com/stac-utils/pystac/tree/main/docs/tutorials.

The easiest way to generate the required STAC Items is to copy over the metadata directly from your files.

Example | Open Dataset¶

import urllib.request

# Download the dataset locally
urllib.request.urlretrieve('https://data.aviso.altimetry.fr/aviso-gateway/data/indicators/OHC_EEI/4DAtlantic_OHC/OHC_4DATLANTIC_200204_202212_V2-0.nc', 'OHC_4DATLANTIC_200204_202212_V2-0.nc')

('OHC_4DATLANTIC_200204_202212_V2-0.nc',
 <http.client.HTTPMessage at 0x17f770190>)

# open dataset

# define relative filepath within the folder structure you want to upload to the PRRs
filepath = 'OHC_4DATLANTIC_200204_202212_V2-0.nc'

ds = xr.open_dataset(filepath)
ds

# helper function to convert numpy arrays to lists
import numpy as np
def convert_to_json_serialisable(attrs):
    attrs = attrs.copy()
    for attr in attrs.keys():
        if isinstance(attrs[attr], np.ndarray):
            attrs[attr] = attrs[attr].tolist()
    return attrs

Example | Create valid STAC Item from your product (nc)¶

# Describe the first file (Item)


# 1. extract the spatial extent from the .nc file 
bbox = [ds['longitude'].values.min(), ds['latitude'].values.min(), ds['longitude'].values.max(), ds['latitude'].values.max(), ]
geometry = json.loads(json.dumps(shapely.box(*bbox).__geo_interface__))

# 2. extract additional information (properties) from the .nc file and create the STAC Item
item = pystac.Item(
    id=collectionid + 'v2',
    geometry=geometry,
    datetime=datetime.strptime('2025-02-05', '%Y-%m-%d'),
    bbox=bbox,
    properties= {
        "history": ds.attrs['history'],
        "source": ds.attrs['source'],
        "comment": ds.attrs['comment'],
        "references": ds.attrs['references'],
        "summary": ds.attrs['summary'],
        "version": ds.attrs['version'],
        "conventions": ds.attrs['Conventions'],
    } # please note that this field is not mandatory by STAC specification, 
        # and depends on the information you have provided within your original dataset.
      # You are encouraged to provide as complete information as possible here, to make sure your product has rich metadata, facilitating product discoverability and usability
)

# 3. Extract variable properties at the Item level, since there is only one file
stac_property_prefix = 'variable_'
item.properties[f"{stac_property_prefix}ohc"] = convert_to_json_serialisable(ds.variables['ohc'].attrs)
item.properties[f"{stac_property_prefix}ohc_var_covar_matrix_local"] = convert_to_json_serialisable(ds.variables['ohc_var_covar_matrix_local'].attrs)
item.properties[f"{stac_property_prefix}ohc_mask"] = convert_to_json_serialisable(ds.variables['ohc_mask'].attrs)


# 4. Add asset
item.add_asset(
            key='OHC Atlantic Dataset', # title can be arbitrary
            asset=pystac.Asset(
                href=f'/d/{collectionid}/{filepath}', # keep the /d/ reference
                media_type="application/x-netcdf",
                roles=["data"],
            )
) ## Please note that asset can describe a satllite image band, as well as single nc or tiff file, depending on the orginal data structure.

item # Preview created Item

3. Add the STAC Item to the STAC Collection¶

Adding the Items to the Collection is a single function call when using a library such as pystac.

collection.add_item(item)

4. Save the Collection¶

Again this step is a single function call.

collection.normalize_and_save(
    root_href='example_4datlantic/', # path to the self-contained folder with STAC Collection
    catalog_type=pystac.CatalogType.SELF_CONTAINED
)

collection

Congratulations, you have created your first STAC Collection.
¶

Now, you have your results ready to be ingested into ESA PRR. To request data storage in ESA PRR, contact EarthCODE team at: earth-code@esa.int and provide following information:

your project name
total size of your dataset
link to STAC Collection created together with associated Items (e.g. entire example_4datlantic folder) - can be provided as a .zip or link to online repository / GitHub public repository
link to the datasets (access link to final outcomes of the project or assets)
specify any restrictions related to the access of your dataset.
in the email, do not forget to CC your ESA TO to acknowledge that the dataset will be imported into PRR.

Once the email is received, the EarthCODE team will make a request to publish your product into PRR on your behalf (in the future the self-ingestion system will be supported).

Once the collection is imported you will receive a dedicated URL to your products, which you can use to create the record on Open Science Data Catalogue to make your data discoverable or/and request a DOI for your dataset (at the moment this has to be done by external service of your choice).

Acknowledgments¶

We gratefully acknowledge the 4DATLANTIC-OHC project team for providing access to the data used in this example.

EarthCODE Examples

Stactools Full Example