Accessing Lightning2EarthCARE data collections

This notebook demonstrates how to access and query Lightning2EarthCARE data collections stored in object storage. It includes examples for locating and loading data from different collections, using both GeoPandas and DuckDB.

The project provides three related datasets describing lightning activity in the context of EarthCARE observations:

1. EarthCARE-frame lightning – lightning groups collocated with individual EarthCARE MSI-like frames, including activity within the frame and a surrounding 0.5° box within ±1 hour of overpass time.

2. EarthCARE along-track lightning counts – lightning statistics referenced to EarthCARE CPR samples along the nadir track, based on counts within defined spatial and temporal windows.

3. EarthCARE lightning storm catalogue – lightning clusters sampled along the EarthCARE nadir track by CPR and ATLID, with both cluster-level properties and time-evolving lightning activity around the overpass.

All datasets are distributed as Parquet files, with MTG-LI and GOES-GLM observations provided separately where relevant.

# Imports and storage configuration
import geopandas as gpd
import pandas as pd
import duckdb

# setup bucket access
bucket = 's3://EarthCODE/'
endpoint_url = "https://s3.waw4-1.cloudferro.com"
region_name = "eu-west-2"
prefix = 'OSCAssets/storm-data/'

EarthCARE-frame lightning collection¶

This collection is stored in monthly parquet files.
Files follow the naming convention:

EC_lightning_<SOURCE>_<YEAR>_<MONTH>.parquet

where:

<SOURCE> is the lightning source (GLM or LI)
<YEAR> is the calendar year
<MONTH> is the calendar month

These files can be accessed in two ways:

Direct file access
If you already know which source, year, and month you want, you can load the corresponding parquet file directly.
Mapping-based access
If you want data for a specific EarthCARE frame, you can use the earthcare_id mapping table to identify which parquet file or files contain that frame.

Locate a parquet file for a specific EarthCARE frame¶

A single EarthCARE frame may be available in more than one monthly parquet file, corresponding to different lightning sources (GLM and LI).

The cells below:

load the earthcare_id mapping table
select an example earthcare_id
define an optional geographic bounding box
choose the lightning source to load (GLM or LI)
identify the parquet file corresponding to the selected source

# read earthcare_id file mapping
mapping = pd.read_parquet(
    f"{bucket}{prefix}earthcare_id_mapping.parquet",
    storage_options={ "anon": True, 
                    "client_kwargs": {
                        "endpoint_url": endpoint_url,
                        "region_name": region_name
                    }
    }
).set_index('earthcare_id')
mapping

mapping.loc['09353D'].values

array([array(['EC_lightning_GLM_2026_1.parquet'], dtype=object)],
      dtype=object)

earthcare_id = "09353D"
example_bbox = (-62.0, -25.5, -55.5, -18.0)
# user choice: "GLM", or "LI"
source = "GLM"

matched_files = mapping.loc[earthcare_id].values[0]
print("Available files:", matched_files)

selected_file = [f for f in matched_files if f"_{source}_" in f][0]
print("Selected file:", selected_file)

Available files: ['EC_lightning_GLM_2026_1.parquet']
Selected file: EC_lightning_GLM_2026_1.parquet

Using Geopandas to load the data¶

Read the selected parquet file¶

gdf = gpd.read_parquet(
    f"{bucket}{prefix}{selected_file}",
    storage_options={ "anon": True, 
                    "client_kwargs": {
                        "endpoint_url": endpoint_url,
                        "region_name": region_name
                    }
    }
)

Or read only a spatial subset of the selected parquet file¶

gdf = gpd.read_parquet(
    f"{bucket}{prefix}{selected_file}",
    storage_options={ "anon": True, 
                    "client_kwargs": {
                        "endpoint_url": endpoint_url,
                        "region_name": region_name
                    }
    },
bbox=example_bbox
)

Or filter the data of the selected parquet file by EarthCARE ID¶

gdf = gpd.read_parquet(
    f"{bucket}{prefix}{selected_file}",
    storage_options={ "anon": True, 
                    "client_kwargs": {
                        "endpoint_url": endpoint_url,
                        "region_name": region_name
                    }
    },
    filters=[('earthcare_id', "==", earthcare_id)],
)
gdf

# save subset of parquet file
gdf.to_parquet(f'{selected_file}')

Using Duckdb to load the data¶

Read a spatial subset of the selected parquet file¶

# Configure DuckDB for your S3 endpoint
duckdb.sql("INSTALL httpfs; LOAD httpfs; INSTALL spatial; LOAD spatial;")
duckdb.sql(f"SET s3_endpoint='{endpoint_url.replace('https://', '')}';")
duckdb.sql(f"SET s3_region='{region_name}';")

file_path = f"s3://{bucket.replace('s3://', '')}{prefix}{selected_file}"

query = f"""
SELECT 
    * EXCLUDE (geometry), 
    ST_AsWKB(geometry) AS geometry
    FROM read_parquet('{file_path}') 
    WHERE longitude >= {example_bbox[0]} 
      AND longitude <= {example_bbox[2]}
      AND latitude >= {example_bbox[1]} 
      AND latitude <= {example_bbox[3]} 
"""

# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream.read_all()
df = arrow_table.to_pandas()
gdf = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']))
gdf

Column filtering¶

file_path = f"s3://{bucket.replace('s3://', '')}{prefix}{selected_file}"

query = f"""
    SELECT 
    * EXCLUDE (geometry), 
    ST_AsWKB(geometry) AS geometry
    FROM read_parquet('{file_path}') 
    WHERE earthcare_id = '{earthcare_id}'
"""

# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream
df = arrow_table.to_pandas()
points = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')
points

points['parent_cluster_id'].unique()

array([ -1,   0,   1,   2,   3,   4,   6,   5,   7,   8,  31,  30,  18,
        34,  32,  37,  40,  39,  36,  38,  41,  42,  44,  46,  45,  43,
        54,  52,  56,  57,  59,  61,  58,  55,  53,  49,  47,  48,  50,
        51,  60,  35,  33,  28,  26,  27,  29,  11,  16,  21,  22,  15,
        17,  23,  24,  25,  20,  14,  13,  12,   9,  10,  83, 102, 101,
       100,  88,  92,  90,  91,  93,  96,  95,  94,  97,  98,  99,  89,
        87,  86,  80,  81,  78,  64,  66,  63,  68,  69,  71,  70,  73,
        74,  75,  76,  67,  65,  62,  79,  77,  82,  84,  85], dtype=int16)

points['cluster_id'].unique()

array([nan,  5., 17., -1., 15., 13., 10., 11., 12.,  8., 21.,  7.,  3.,
       22.,  1., 16.,  9., 14.,  4.,  0., 19.,  6., 20., 27., 28., 29.,
       26., 25., 18., 24.,  2., 23.], dtype=float32)

# gdf = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')

# gdf.explore()

file_path = f"s3://{bucket.replace('s3://', '')}OSCAssets/storm-data/EC_track_lightning_GLM.parquet"

query = f"""
    SELECT 
    * EXCLUDE (geometry), 
    ST_AsWKB(geometry) AS geometry
    FROM read_parquet('{file_path}') 
    WHERE earthcare_id = '{earthcare_id}'
"""

# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream
df = arrow_table.to_pandas()

track = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')
track

track.cluster_id.unique()

array([ 3, 4, 5, 6, 8, 10, 11, 17, 23, 24, 25, 26, 27, 28, 29])

points.parent_cluster_id.unique()

array([ -1,   0,   1,   2,   3,   4,   6,   5,   7,   8,  31,  30,  18,
        34,  32,  37,  40,  39,  36,  38,  41,  42,  44,  46,  45,  43,
        54,  52,  56,  57,  59,  61,  58,  55,  53,  49,  47,  48,  50,
        51,  60,  35,  33,  28,  26,  27,  29,  11,  16,  21,  22,  15,
        17,  23,  24,  25,  20,  14,  13,  12,   9,  10,  83, 102, 101,
       100,  88,  92,  90,  91,  93,  96,  95,  94,  97,  98,  99,  89,
        87,  86,  80,  81,  78,  64,  66,  63,  68,  69,  71,  70,  73,
        74,  75,  76,  67,  65,  62,  79,  77,  82,  84,  85], dtype=int16)

m = track[track.cluster_id == 29].explore(color='orange')
m = points[points.cluster_id == 29].explore(m=m)
m

file_path = f"s3://{bucket.replace('s3://', '')}OSCAssets/storm-data/EC_track_lightning_LI.parquet"

query = f"""
    SELECT 
    * EXCLUDE (geometry), 
    ST_AsWKB(geometry) AS geometry
    FROM read_parquet('{file_path}') 
    WHERE earthcare_id = '{earthcare_id}'
"""

# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream
df = arrow_table.to_pandas()

track = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')
track

track['cluster_id'].unique()

array([0, 1, 2])

track.explore()

EarthCARE along-track lightning counts & lightning storm catalogue collections¶

These collections can be accessed directly by loading the corresponding parquet files.

For the EarthCARE along-track lightning counts collection, there are two files, one for each lightning source:

EC_track_lightning_GLM.parquet
EC_track_lightning_LI.parquet

For the storm catalogue, there are two complementary files:

EC_lightning_clusters.parquet — cluster summary information
EC_lightning_cluster_evolution.parquet — time-evolution information for each cluster

selected_file = 'EC_lightning_clusters.parquet'
earthcare_id = "01101E"

gdf = gpd.read_parquet(
    f"{bucket}{prefix}{selected_file}",
    storage_options={ "anon": True, 
                    "client_kwargs": {
                        "endpoint_url": endpoint_url,
                        "region_name": region_name
                    }
    },
    # optional filtering
    filters=[('earthcare_id', "==", earthcare_id)],
)
gdf