Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Accessing Lightning2EarthCARE data collections

ESA

This notebook demonstrates how to access and query Lightning2EarthCARE data collections stored in object storage. It includes examples for locating and loading data from different collections, using both GeoPandas and DuckDB.

The project provides three related datasets describing lightning activity in the context of EarthCARE observations:

1. EarthCARE-frame lightning – lightning groups collocated with individual EarthCARE MSI-like frames, including activity within the frame and a surrounding 0.5° box within ±1 hour of overpass time.

2. EarthCARE along-track lightning counts – lightning statistics referenced to EarthCARE CPR samples along the nadir track, based on counts within defined spatial and temporal windows.

3. EarthCARE lightning storm catalogue – lightning clusters sampled along the EarthCARE nadir track by CPR and ATLID, with both cluster-level properties and time-evolving lightning activity around the overpass.

All datasets are distributed as Parquet files, with MTG-LI and GOES-GLM observations provided separately where relevant.

# Imports and storage configuration
import geopandas as gpd
import pandas as pd
import duckdb

# setup bucket access
bucket = 's3://EarthCODE/'
endpoint_url = "https://s3.waw4-1.cloudferro.com"
region_name = "eu-west-2"
prefix = 'OSCAssets/storm-data/'

EarthCARE-frame lightning collection

This collection is stored in monthly parquet files.
Files follow the naming convention:

EC_lightning_<SOURCE>_<YEAR>_<MONTH>.parquet

where:

  • <SOURCE> is the lightning source (GLM or LI)

  • <YEAR> is the calendar year

  • <MONTH> is the calendar month

These files can be accessed in two ways:

  1. Direct file access
    If you already know which source, year, and month you want, you can load the corresponding parquet file directly.

  2. Mapping-based access
    If you want data for a specific EarthCARE frame, you can use the earthcare_id mapping table to identify which parquet file or files contain that frame.

Locate a parquet file for a specific EarthCARE frame

A single EarthCARE frame may be available in more than one monthly parquet file, corresponding to different lightning sources (GLM and LI).

The cells below:

  • load the earthcare_id mapping table

  • select an example earthcare_id

  • define an optional geographic bounding box

  • choose the lightning source to load (GLM or LI)

  • identify the parquet file corresponding to the selected source

# read earthcare_id file mapping
mapping = pd.read_parquet(
    f"{bucket}{prefix}earthcare_id_mapping.parquet",
    storage_options={ "anon": True, 
                    "client_kwargs": {
                        "endpoint_url": endpoint_url,
                        "region_name": region_name
                    }
    }
).set_index('earthcare_id')
mapping
Loading...
mapping.loc['09353D'].values
array([array(['EC_lightning_GLM_2026_1.parquet'], dtype=object)], dtype=object)
earthcare_id = "09353D"
example_bbox = (-62.0, -25.5, -55.5, -18.0)
# user choice: "GLM", or "LI"
source = "GLM"

matched_files = mapping.loc[earthcare_id].values[0]
print("Available files:", matched_files)

selected_file = [f for f in matched_files if f"_{source}_" in f][0]
print("Selected file:", selected_file)
Available files: ['EC_lightning_GLM_2026_1.parquet']
Selected file: EC_lightning_GLM_2026_1.parquet

Using Geopandas to load the data

Read the selected parquet file

gdf = gpd.read_parquet(
    f"{bucket}{prefix}{selected_file}",
    storage_options={ "anon": True, 
                    "client_kwargs": {
                        "endpoint_url": endpoint_url,
                        "region_name": region_name
                    }
    }
)

Or read only a spatial subset of the selected parquet file

gdf = gpd.read_parquet(
    f"{bucket}{prefix}{selected_file}",
    storage_options={ "anon": True, 
                    "client_kwargs": {
                        "endpoint_url": endpoint_url,
                        "region_name": region_name
                    }
    },
bbox=example_bbox
)

Or filter the data of the selected parquet file by EarthCARE ID

gdf = gpd.read_parquet(
    f"{bucket}{prefix}{selected_file}",
    storage_options={ "anon": True, 
                    "client_kwargs": {
                        "endpoint_url": endpoint_url,
                        "region_name": region_name
                    }
    },
    filters=[('earthcare_id', "==", earthcare_id)],
)
gdf
# save subset of parquet file
gdf.to_parquet(f'{selected_file}')

Using Duckdb to load the data

Read a spatial subset of the selected parquet file

# Configure DuckDB for your S3 endpoint
duckdb.sql("INSTALL httpfs; LOAD httpfs; INSTALL spatial; LOAD spatial;")
duckdb.sql(f"SET s3_endpoint='{endpoint_url.replace('https://', '')}';")
duckdb.sql(f"SET s3_region='{region_name}';")
file_path = f"s3://{bucket.replace('s3://', '')}{prefix}{selected_file}"

query = f"""
SELECT 
    * EXCLUDE (geometry), 
    ST_AsWKB(geometry) AS geometry
    FROM read_parquet('{file_path}') 
    WHERE longitude >= {example_bbox[0]} 
      AND longitude <= {example_bbox[2]}
      AND latitude >= {example_bbox[1]} 
      AND latitude <= {example_bbox[3]} 
"""

# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream.read_all()
df = arrow_table.to_pandas()
gdf = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']))
gdf

Column filtering

file_path = f"s3://{bucket.replace('s3://', '')}{prefix}{selected_file}"

query = f"""
    SELECT 
    * EXCLUDE (geometry), 
    ST_AsWKB(geometry) AS geometry
    FROM read_parquet('{file_path}') 
    WHERE earthcare_id = '{earthcare_id}'
"""

# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream
df = arrow_table.to_pandas()
points = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')
points
Loading...
Loading...
points['parent_cluster_id'].unique()
array([ -1, 0, 1, 2, 3, 4, 6, 5, 7, 8, 31, 30, 18, 34, 32, 37, 40, 39, 36, 38, 41, 42, 44, 46, 45, 43, 54, 52, 56, 57, 59, 61, 58, 55, 53, 49, 47, 48, 50, 51, 60, 35, 33, 28, 26, 27, 29, 11, 16, 21, 22, 15, 17, 23, 24, 25, 20, 14, 13, 12, 9, 10, 83, 102, 101, 100, 88, 92, 90, 91, 93, 96, 95, 94, 97, 98, 99, 89, 87, 86, 80, 81, 78, 64, 66, 63, 68, 69, 71, 70, 73, 74, 75, 76, 67, 65, 62, 79, 77, 82, 84, 85], dtype=int16)
points['cluster_id'].unique()
array([nan, 5., 17., -1., 15., 13., 10., 11., 12., 8., 21., 7., 3., 22., 1., 16., 9., 14., 4., 0., 19., 6., 20., 27., 28., 29., 26., 25., 18., 24., 2., 23.], dtype=float32)
# gdf = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')

# gdf.explore()
file_path = f"s3://{bucket.replace('s3://', '')}OSCAssets/storm-data/EC_track_lightning_GLM.parquet"

query = f"""
    SELECT 
    * EXCLUDE (geometry), 
    ST_AsWKB(geometry) AS geometry
    FROM read_parquet('{file_path}') 
    WHERE earthcare_id = '{earthcare_id}'
"""

# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream
df = arrow_table.to_pandas()

track = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')
track
Loading...
Loading...
track.cluster_id.unique()
array([ 3, 4, 5, 6, 8, 10, 11, 17, 23, 24, 25, 26, 27, 28, 29])
points.parent_cluster_id.unique()
array([ -1, 0, 1, 2, 3, 4, 6, 5, 7, 8, 31, 30, 18, 34, 32, 37, 40, 39, 36, 38, 41, 42, 44, 46, 45, 43, 54, 52, 56, 57, 59, 61, 58, 55, 53, 49, 47, 48, 50, 51, 60, 35, 33, 28, 26, 27, 29, 11, 16, 21, 22, 15, 17, 23, 24, 25, 20, 14, 13, 12, 9, 10, 83, 102, 101, 100, 88, 92, 90, 91, 93, 96, 95, 94, 97, 98, 99, 89, 87, 86, 80, 81, 78, 64, 66, 63, 68, 69, 71, 70, 73, 74, 75, 76, 67, 65, 62, 79, 77, 82, 84, 85], dtype=int16)
m = track[track.cluster_id == 29].explore(color='orange')
m = points[points.cluster_id == 29].explore(m=m)
m
Loading...
file_path = f"s3://{bucket.replace('s3://', '')}OSCAssets/storm-data/EC_track_lightning_LI.parquet"

query = f"""
    SELECT 
    * EXCLUDE (geometry), 
    ST_AsWKB(geometry) AS geometry
    FROM read_parquet('{file_path}') 
    WHERE earthcare_id = '{earthcare_id}'
"""

# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream
df = arrow_table.to_pandas()

track = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')
track
Loading...
Loading...
track['cluster_id'].unique()
array([0, 1, 2])
track.explore()
Loading...

EarthCARE along-track lightning counts & lightning storm catalogue collections

These collections can be accessed directly by loading the corresponding parquet files.

For the EarthCARE along-track lightning counts collection, there are two files, one for each lightning source:

  • EC_track_lightning_GLM.parquet

  • EC_track_lightning_LI.parquet

For the storm catalogue, there are two complementary files:

  • EC_lightning_clusters.parquet — cluster summary information

  • EC_lightning_cluster_evolution.parquet — time-evolution information for each cluster

selected_file = 'EC_lightning_clusters.parquet'
earthcare_id = "01101E"

gdf = gpd.read_parquet(
    f"{bucket}{prefix}{selected_file}",
    storage_options={ "anon": True, 
                    "client_kwargs": {
                        "endpoint_url": endpoint_url,
                        "region_name": region_name
                    }
    },
    # optional filtering
    filters=[('earthcare_id', "==", earthcare_id)],
)
gdf