EarthCODE: Data Discovery and Access#

EarthCODE provides a structured way to explore and access scientific datasets, metadata, and dependencies. This notebook walks through the key aspects of programmatic data discovery and access using earthcode python library.
The examples you see below are live metadata records that live on the Open Science Catalog. The Open Science Catalog (OSC) is a key component of the ESA EO Open Science framework, and a publicly available web-based application designed to provide easy access to scientific datasets, geoscience products, and scientific resources developed under ESA-funded Earth Observation (EO) research projects. Using the search functionality you can find relevant products, variables, eo-missions, and projects using various filters on the metadata as well as geographical constraints.

With this search you can directly use natural language for your queries - passing it an abstract, idea or an exhurb from your research to see what’s relevant for you in the Open Science Catalog.

If you are a scientist publishing to the EarthCODE OSC you can get relevant suggestions for variables, themes or missions for your dataset.

Search for Products in the Open Science Catalog#

1. Search with a natural language query#

There are different ways of searching for data - the simplest one is searching with a natural language query. The results will be returned as a list of pystac collection (for products) or catalog objects.

Run the cell below to discover the products that describe the concept you are looking for. In this case it will be forest fires. You can replace this with any concept / geophysical variable / region you would like to look at.

Note: In the current version of the earthcode library - the first run of the library right after you install it, searching will prefetch a mini-llm model which should take approximately 1 minute.

res = search("forest fires")[0]
res
<Collection id=seasfire-cube>

If you would like to access the collection via the Open Science Catalog web app, you can follow the link in the extra_fields attribute as in the example below:

res.extra_fields["osc_url"] 
'https://opensciencedata.esa.int/stac-browser/#/products/seasfire-cube/collection.json'

2. Search with a known collection id#

If you are already a user of Open Science Catalogue, and the structure of the entries, you may be able to retrieve information about the specific Product by searching through its ID. You may pass the id as published in Open Science Catalog Metadata Repository
This keeps results tight and avoids scanning the whole catalog. You can either pass one id as a str or multiple ids as a list List[str]. In both cases the result will be an array object.

For example, here we’ve taken the id of the search result above to fetch it directly by its id instead of a semantic search:

search(collection_ids="seasfire-cube")[0]
<Collection id=seasfire-cube>

3. Search by spatial extent#

You can also use bbox=[minx, miny, maxx, maxy] to limit the results by geography. By default the bbox only needs to intersect the product footprint. For example, a small local area bbox will always intersect global datasets. In the below example we search for snow data over the Alps.

Additionally, we can set a limit on the amount of results we get back using the limit parameter as in the example below.

Note that the result is a global dataset which intersects with our bbox for the Alps.

alps_bbox = [5.95591129, 45.81799493, 10.49229402, 47.80846475]
search("snow data", limit=1, bbox=alps_bbox)[0]
<Collection id=binary-wet-snow-s14science-snow>

If instead you need the whole footprint to sit inside the bbox (containment instead of overlap) use intersects=False.

For example, in the cell below we run the same query again but with intersects=False and get results that are only within the Alps (excluding global ones and others that overlap).

search("snow data", limit=1, bbox=alps_bbox, intersects=False)[0]
<Collection id=snow-cover-alpglacier>

Search for Variables#

This section provides a way to search and discover the list of Variables available on Open Science Catalog. This might be useful, if you want to contribute and publish your product or workflow but, do not know which variables to choose from a long list of entries. Variables on Open Science Catalog are defined as geoscience, climate and environmental variables that describe specific datasets and workflows/experiments. Full list of variables may be explored under: https://opensciencedata.esa.int/variables/catalog

The search through variables works in the same way as for products, although there is no filter for bbox or filter for eo-missions.

1. Search with a natural language query#

You can search for terms directly using natural language as in the example below:

chlorophyll = search("chlorophyll", type="variables", limit=2)[0]
chlorophyll
<Catalog id=chlorophyll-concentration>

2. Find appropriate variable for your product based on its description#

You can use the variable search to help with approximations about what variables (from current list) to use for your published dataset by feeding the description of your project to the search - and avoid creating new variables which already might exist in the catalog.
For example the below example searches variables that migth fit this product: https://opensciencedata.esa.int/products/ice-sheet-velocity-antarctic-2021/collection.

Example product description:
‘This dataset contains monthly gridded ice velocity maps of the Antarctic Ice Sheet derived from Sentinel-1 data acquired between 2021-01-01 and 2021-12-31. It was generated by ENVEO, as part of the ESA Antarctic Ice Sheet Climate Change Initiative project (Antarctic_Ice_Sheet_cci). The surface velocity is derived by applying feature tracking techniques using Sentinel-1 synthetic aperture radar (SAR) data acquired in the Interferometric Wide (IW) swath mode. Ice velocity is provided at 200m grid spacing in Polar Stereographic projection (EPSG: 3031). The horizontal velocity components are provided in true meters per day, towards easting and northing direction of the grid. The vertical displacement is derived from a digital elevation model. Provided is a NetCDF file with the velocity components: vx, vy, vz, along with maps showing the magnitude of the horizontal components, the valid pixel count and uncertainty. The product combines all ice velocity maps, based on 6- and 12-day repeats, acquired within a single month in a monthly averaged product.’

product_description = search(collection_ids="ice-sheet-velocity-antarctic-2021")[0].description

result = search(product_description, type="variables", limit=5)

[r.title for r in result]
['Ice sheet topography',
 'Ice Temperature',
 'Ice Velocity',
 'Sea Ice Age',
 'Glacier motion']

If your product is not published yet, you can find appropriate variable based on the description passed to the search function as a standard string. Just replace the description with the one applied to your product and search through the catalog.

product_description_txt = '' # Paste description of your dataset here
result = search(product_description_txt, type="variables", limit=5)

[r.title for r in result]
['(13)CH4 delta',
 '(13)CO Delta',
 '(13)CO2 Delta',
 '(14)CH4 Delta',
 '(14)CO Number Concentration']

Search for EO-Mission#

Open Science Catalog provides as well a list of Earth Observation Satellite Missions contributing to the published scientific research. Each Product and Dataset published on Open Science Catalog must have description and link to appropriate EO Mission based on which the results were produced. Since there are some products which are based on in-situ observations or numerical models, these were also added to that category.
If you would like to discover the list of available EO Missions, use the code cells below to filter for the appropriate one.

1. Search with a natural language query#

You can search for terms directly using natural language as in the example below:

search("sentinel-1", type="eo-missions", limit=2)[0]
<Catalog id=sentinel-1>

2. Find eo-mission based on your product description#

To find the mission of interest it works in the same way as in case of the variables. We will use here the same example to retrieve the possible missions to associate to your product.

product_description = search(collection_ids="ice-sheet-velocity-antarctic-2021")[0].description

result = search(product_description, type="eo-missions", limit=5)

[r.title for r in result]
['ICESat', 'ICESat-2', 'CryoSat', 'Sentinel-1', 'CALIPSO']

If your product is not published yet, you can find appropriate eo-mission based on the description passed to the search function as a standard string. Just replace the description with the one applied to your product and search through the catalog.

product_description_txt = '' # Paste description of your dataset here
result = search(product_description_txt, type="eo-missions", limit=5)

[r.title for r in result]
['AEOLUS', 'ALOS', 'ALOS-2', 'ALOS-4', 'ALtiKa']

Filtering Products by Variables, EO Missions and Themes#

Once you’ve identified a Variable record (for example, chlorophyll from the Variables section), you can use its id to filter products to only those that explicitly declare that variable in their metadata. This is useful when you want results that are semantically relevant and models a particular variable of interest.

1. Filter Products by Variable id#

In the example below, we search for global chlorophyll datasets and constrain results to products that include the chlorophyll variable:

search("global chlorophyll dataset", variable=chlorophyll.id, type="products")
[<Collection id=monthly-global-chlorophyll-a-dataset-9-km-oc-cci-v42>,
 <Collection id=global-particulate-organic-carbon-v5-bicep>,
 <Collection id=l4-chla-eo4sibs>,
 <Collection id=l3-chla-daily-eo4sibs>,
 <Collection id=4dmed-3d-prim-prod-150>,
 <Collection id=high-res-phytoplankton-chl-a-data-at-discrete-light-stations-polarstern-cruise-ps113>,
 <Collection id=phytoplankton-group-chl-ll-using-the-mean-dac-over-the-first-optical-depth-kdmean-of-the-measured-radiometric-profile>,
 <Collection id=coastal-carbon-mapper-reflectance-water-quality-data>,
 <Collection id=4dmed-t-s-geo-a-150>,
 <Collection id=secchi-disk-depth-lpf-physioglob>]

2. Filter Products by EO-Missions and Vairable id#

You can also filter by eo-missions (e.g. sentinel-1), to find products that have been produced by that instrument. The queries are addative so you can tighten the same query above further by restricting it to a specific EO mission. This is particularly helpful when multiple missions produce similar variables (e.g., ocean colour chlorophyll from different instruments), but you only want products traceable to a given platform.

search("global chlorophyll dataset", variable=chlorophyll.id, type="products", mission="sentinel-3")
[<Collection id=monthly-global-chlorophyll-a-dataset-9-km-oc-cci-v42>,
 <Collection id=global-particulate-organic-carbon-v5-bicep>,
 <Collection id=l4-chla-eo4sibs>,
 <Collection id=l3-chla-daily-eo4sibs>]

3. Filter Products by EO-Missions, Variable id and Keyword(s)#

You can also add a free-text keyword search which finds all literal occurences of the phrase/word you’ve passed in the title, description or keyword entries of the product you are looking for. Here we narrow to results that mention “daily”, while still keeping the variable and mission constraints:

search("global chlorophyll dataset", variable=chlorophyll.id, type="products", mission="sentinel-3", keyword="daily")

You can also refine variable search by applying the filters for theme and keyword. This is especially useful when the catalog contains many similar parameters and you want to narrow results to a particular scientific domain or concept.

4. Filter Products by Keyword#

results = search(keyword="seasonal fire modeling", type="products", limit=5)
ids = {r.id for r in results}

print(ids)
{'seasfire-cube'}

5. Filter Products by Free text, keyword and theme#

For example, the query below searches for a variable related to jet streams, but restricts results to the oceans theme and further filters by the keyword cyclone:

search("find me a variable that talks about jet streams",theme="oceans", type="variables", keyword="cyclone")[0].title
'East Atlantic Jet Pattern'