This notebook demonstrates how to access and query Lightning2EarthCARE data collections stored in object storage. It includes examples for locating and loading data from different collections, using both GeoPandas and DuckDB.
The project provides three related datasets describing lightning activity in the context of EarthCARE observations:
1. EarthCARE-frame lightning – lightning groups collocated with individual EarthCARE MSI-like frames, including activity within the frame and a surrounding 0.5° box within ±1 hour of overpass time.
2. EarthCARE along-track lightning counts – lightning statistics referenced to EarthCARE CPR samples along the nadir track, based on counts within defined spatial and temporal windows.
3. EarthCARE lightning storm catalogue – lightning clusters sampled along the EarthCARE nadir track by CPR and ATLID, with both cluster-level properties and time-evolving lightning activity around the overpass.
All datasets are distributed as Parquet files, with MTG-LI and GOES-GLM observations provided separately where relevant.
# Imports and storage configuration
import geopandas as gpd
import pandas as pd
import duckdb
# setup bucket access
bucket = 's3://EarthCODE/'
endpoint_url = "https://s3.waw4-1.cloudferro.com"
region_name = "eu-west-2"
prefix = 'OSCAssets/storm-data/'EarthCARE-frame lightning collection¶
This collection is stored in monthly parquet files.
Files follow the naming convention:
EC_lightning_<SOURCE>_<YEAR>_<MONTH>.parquet
where:
<SOURCE>is the lightning source (GLMorLI)<YEAR>is the calendar year<MONTH>is the calendar month
These files can be accessed in two ways:
Direct file access
If you already know which source, year, and month you want, you can load the corresponding parquet file directly.Mapping-based access
If you want data for a specific EarthCARE frame, you can use theearthcare_idmapping table to identify which parquet file or files contain that frame.
Locate a parquet file for a specific EarthCARE frame¶
A single EarthCARE frame may be available in more than one monthly parquet file, corresponding to different lightning sources (GLM and LI).
The cells below:
load the
earthcare_idmapping tableselect an example
earthcare_iddefine an optional geographic bounding box
choose the lightning source to load (
GLMorLI)identify the parquet file corresponding to the selected source
# read earthcare_id file mapping
mapping = pd.read_parquet(
f"{bucket}{prefix}earthcare_id_mapping.parquet",
storage_options={ "anon": True,
"client_kwargs": {
"endpoint_url": endpoint_url,
"region_name": region_name
}
}
).set_index('earthcare_id')
mappingmapping.loc['09353D'].valuesarray([array(['EC_lightning_GLM_2026_1.parquet'], dtype=object)],
dtype=object)earthcare_id = "09353D"
example_bbox = (-62.0, -25.5, -55.5, -18.0)
# user choice: "GLM", or "LI"
source = "GLM"
matched_files = mapping.loc[earthcare_id].values[0]
print("Available files:", matched_files)
selected_file = [f for f in matched_files if f"_{source}_" in f][0]
print("Selected file:", selected_file)Available files: ['EC_lightning_GLM_2026_1.parquet']
Selected file: EC_lightning_GLM_2026_1.parquet
gdf = gpd.read_parquet(
f"{bucket}{prefix}{selected_file}",
storage_options={ "anon": True,
"client_kwargs": {
"endpoint_url": endpoint_url,
"region_name": region_name
}
}
)Or read only a spatial subset of the selected parquet file¶
gdf = gpd.read_parquet(
f"{bucket}{prefix}{selected_file}",
storage_options={ "anon": True,
"client_kwargs": {
"endpoint_url": endpoint_url,
"region_name": region_name
}
},
bbox=example_bbox
)Or filter the data of the selected parquet file by EarthCARE ID¶
gdf = gpd.read_parquet(
f"{bucket}{prefix}{selected_file}",
storage_options={ "anon": True,
"client_kwargs": {
"endpoint_url": endpoint_url,
"region_name": region_name
}
},
filters=[('earthcare_id', "==", earthcare_id)],
)
gdf# save subset of parquet file
gdf.to_parquet(f'{selected_file}')# Configure DuckDB for your S3 endpoint
duckdb.sql("INSTALL httpfs; LOAD httpfs; INSTALL spatial; LOAD spatial;")
duckdb.sql(f"SET s3_endpoint='{endpoint_url.replace('https://', '')}';")
duckdb.sql(f"SET s3_region='{region_name}';")file_path = f"s3://{bucket.replace('s3://', '')}{prefix}{selected_file}"
query = f"""
SELECT
* EXCLUDE (geometry),
ST_AsWKB(geometry) AS geometry
FROM read_parquet('{file_path}')
WHERE longitude >= {example_bbox[0]}
AND longitude <= {example_bbox[2]}
AND latitude >= {example_bbox[1]}
AND latitude <= {example_bbox[3]}
"""
# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream.read_all()
df = arrow_table.to_pandas()
gdf = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']))
gdfColumn filtering¶
file_path = f"s3://{bucket.replace('s3://', '')}{prefix}{selected_file}"
query = f"""
SELECT
* EXCLUDE (geometry),
ST_AsWKB(geometry) AS geometry
FROM read_parquet('{file_path}')
WHERE earthcare_id = '{earthcare_id}'
"""
# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream
df = arrow_table.to_pandas()
points = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')
pointspoints['parent_cluster_id'].unique()array([ -1, 0, 1, 2, 3, 4, 6, 5, 7, 8, 31, 30, 18,
34, 32, 37, 40, 39, 36, 38, 41, 42, 44, 46, 45, 43,
54, 52, 56, 57, 59, 61, 58, 55, 53, 49, 47, 48, 50,
51, 60, 35, 33, 28, 26, 27, 29, 11, 16, 21, 22, 15,
17, 23, 24, 25, 20, 14, 13, 12, 9, 10, 83, 102, 101,
100, 88, 92, 90, 91, 93, 96, 95, 94, 97, 98, 99, 89,
87, 86, 80, 81, 78, 64, 66, 63, 68, 69, 71, 70, 73,
74, 75, 76, 67, 65, 62, 79, 77, 82, 84, 85], dtype=int16)points['cluster_id'].unique()array([nan, 5., 17., -1., 15., 13., 10., 11., 12., 8., 21., 7., 3.,
22., 1., 16., 9., 14., 4., 0., 19., 6., 20., 27., 28., 29.,
26., 25., 18., 24., 2., 23.], dtype=float32)# gdf = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')
# gdf.explore()file_path = f"s3://{bucket.replace('s3://', '')}OSCAssets/storm-data/EC_track_lightning_GLM.parquet"
query = f"""
SELECT
* EXCLUDE (geometry),
ST_AsWKB(geometry) AS geometry
FROM read_parquet('{file_path}')
WHERE earthcare_id = '{earthcare_id}'
"""
# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream
df = arrow_table.to_pandas()
track = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')
tracktrack.cluster_id.unique()array([ 3, 4, 5, 6, 8, 10, 11, 17, 23, 24, 25, 26, 27, 28, 29])points.parent_cluster_id.unique()array([ -1, 0, 1, 2, 3, 4, 6, 5, 7, 8, 31, 30, 18,
34, 32, 37, 40, 39, 36, 38, 41, 42, 44, 46, 45, 43,
54, 52, 56, 57, 59, 61, 58, 55, 53, 49, 47, 48, 50,
51, 60, 35, 33, 28, 26, 27, 29, 11, 16, 21, 22, 15,
17, 23, 24, 25, 20, 14, 13, 12, 9, 10, 83, 102, 101,
100, 88, 92, 90, 91, 93, 96, 95, 94, 97, 98, 99, 89,
87, 86, 80, 81, 78, 64, 66, 63, 68, 69, 71, 70, 73,
74, 75, 76, 67, 65, 62, 79, 77, 82, 84, 85], dtype=int16)m = track[track.cluster_id == 29].explore(color='orange')
m = points[points.cluster_id == 29].explore(m=m)
mfile_path = f"s3://{bucket.replace('s3://', '')}OSCAssets/storm-data/EC_track_lightning_LI.parquet"
query = f"""
SELECT
* EXCLUDE (geometry),
ST_AsWKB(geometry) AS geometry
FROM read_parquet('{file_path}')
WHERE earthcare_id = '{earthcare_id}'
"""
# Fetch as an Arrow table, then convert to GeoPandas
arrow_stream = duckdb.sql(query).arrow()
arrow_table = arrow_stream
df = arrow_table.to_pandas()
track = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkb(df['geometry']), crs='epsg:4326')
tracktrack['cluster_id'].unique()array([0, 1, 2])track.explore()EarthCARE along-track lightning counts & lightning storm catalogue collections¶
These collections can be accessed directly by loading the corresponding parquet files.
For the EarthCARE along-track lightning counts collection, there are two files, one for each lightning source:
EC_track_lightning_GLM.parquetEC_track_lightning_LI.parquet
For the storm catalogue, there are two complementary files:
EC_lightning_clusters.parquet— cluster summary informationEC_lightning_cluster_evolution.parquet— time-evolution information for each cluster
selected_file = 'EC_lightning_clusters.parquet'
earthcare_id = "01101E"
gdf = gpd.read_parquet(
f"{bucket}{prefix}{selected_file}",
storage_options={ "anon": True,
"client_kwargs": {
"endpoint_url": endpoint_url,
"region_name": region_name
}
},
# optional filtering
filters=[('earthcare_id', "==", earthcare_id)],
)
gdf