Data exploration of the Climate TRACE dataset

Data exploration of the Climate TRACE dataset#

In this chapter, we look at the dataset and we answer a few questions:

which countries are the biggest emitters?
what sectors are the most responsible for emissions?

In a second part, we compare two different views on emissions:

the source view, which looks from a bottom up view at all the sources (factories, mines, farms, …)
the country view, which is a top down approach, and derives many values from aggregate economic activity, for example how many tons of coal were burned to produced annually. This is the official reporting method followed by countries to report their emissions to the United Nations as part of the Paris agreements.

The data has already been prepared from the original Climate TRACE dataset. If you want to understand the preprocessing, read the chapter Ingestion.

%load_ext autoreload
%autoreload 2

import logging
logging.basicConfig(level=logging.INFO)

We import all the libraries that we will use in this notebook:

the Polars library, a very fast package with a clear interface.
the Plotly Express visualization library
the ctrace package (included in this repository), that contains tools to read and understand the Climate TRACE data.

import polars as pl
import plotly.io
plotly.io.templates.default = "plotly_white"
import plotly.express as px

from ctrace.constants import * # We import many useful constants
import ctrace as ct

Country emissions#

The country emissions are available through the read_country_emissions() function. This function will download by default the latest set of reports (currently the V3, publised in November 2024). If the data has already been downloaded, it will use this copy.

The format being returned is a Polars DataFrame. This format will be very familiar to people used to working with Pandas, Spark or R dataframes. The GAS_LIST indicates that we want to load all the gases available.

cedf = ct.read_country_emissions(GAS_LIST)

cedf.head(3)

shape: (3, 11)

iso3_country	start_time	end_time	gas	sector	subsector	emissions_quantity	emissions_quantity_units	temporal_granularity	created_date	modified_date
enum	datetime[ms, UTC]	datetime[ms, UTC]	enum	enum	enum	f64	cat	enum	datetime[ms, UTC]	datetime[ms, UTC]
"ABW"	2015-01-01 00:00:00 UTC	2015-12-31 00:00:00 UTC	"co2"	"fossil-fuel-operations"	"other-fossil-fuel-operations"	0.0	null	"annual"	null	null
"ABW"	2015-01-01 00:00:00 UTC	2015-12-31 00:00:00 UTC	"co2"	"mineral-extraction"	"bauxite-mining"	0.0	null	"annual"	null	null
"ABW"	2015-01-01 00:00:00 UTC	2015-12-31 00:00:00 UTC	"co2"	"transportation"	"domestic-shipping"	90613.375994	null	"annual"	null	null

Understanding the data format#

These rows are a mouthful to digest. Here is how the emissions data is structured.

This data is highly structured, which is reflected in the schema itself. This schema has a lot of enumerations (iso3_country, gas, …), where only a few values are expected. For example, all the country names are represented by their official ISO 3166 3-letter country codes. Internally, Polars can assign small integers to represent them all, which consumes less memory and speeds up manipulations. This has a further advantage: since we know the values to expect, we can give them names in the code such as CH4, CO2, …. We can use all the software tools to find and update references to various gas as we work with it.

TODO

move all this part in its own notebook, this is about technical details

cedf.schema

Schema([('iso3_country',
         Enum(categories=['ABW', 'AFG', 'AGO', 'AIA', 'ALA', 'ALB', 'AND', 'ARE', 'ARG', 'ARM', 'ASM', 'ATA', 'ATF', 'ATG', 'AUS', 'AUT', 'AZE', 'BDI', 'BEL', 'BEN', 'BES', 'BFA', 'BGD', 'BGR', 'BHR', 'BHS', 'BIH', 'BLM', 'BLR', 'BLZ', 'BMU', 'BOL', 'BRA', 'BRB', 'BRN', 'BTN', 'BVT', 'BWA', 'CAF', 'CAN', 'CCK', 'CHE', 'CHL', 'CHN', 'CIV', 'CMR', 'COD', 'COG', 'COK', 'COL', 'COM', 'CPV', 'CRI', 'CUB', 'CUW', 'CXR', 'CYM', 'CYP', 'CZE', 'DEU', 'DJI', 'DMA', 'DNK', 'DOM', 'DZA', 'ECU', 'EGY', 'ERI', 'ESH', 'ESP', 'EST', 'ETH', 'FIN', 'FJI', 'FLK', 'FRA', 'FRO', 'FSM', 'GAB', 'GBR', 'GEO', 'GGY', 'GHA', 'GIB', 'GIN', 'GLP', 'GMB', 'GNB', 'GNQ', 'GRC', 'GRD', 'GRL', 'GTM', 'GUF', 'GUM', 'GUY', 'HKG', 'HMD', 'HND', 'HRV', 'HTI', 'HUN', 'IDN', 'IMN', 'IND', 'IOT', 'IRL', 'IRN', 'IRQ', 'ISL', 'ISR', 'ITA', 'JAM', 'JEY', 'JOR', 'JPN', 'KAZ', 'KEN', 'KGZ', 'KHM', 'KIR', 'KNA', 'KOR', 'KWT', 'LAO', 'LBN', 'LBR', 'LBY', 'LCA', 'LIE', 'LKA', 'LSO', 'LTU', 'LUX', 'LVA', 'MAC', 'MAF', 'MAR', 'MCO', 'MDA', 'MDG', 'MDV', 'MEX', 'MHL', 'MKD', 'MLI', 'MLT', 'MMR', 'MNE', 'MNG', 'MNP', 'MOZ', 'MRT', 'MSR', 'MTQ', 'MUS', 'MWI', 'MYS', 'MYT', 'NAM', 'NCL', 'NER', 'NFK', 'NGA', 'NIC', 'NIU', 'NLD', 'NOR', 'NPL', 'NRU', 'NZL', 'OMN', 'PAK', 'PAN', 'PCN', 'PER', 'PHL', 'PLW', 'PNG', 'POL', 'PRI', 'PRK', 'PRT', 'PRY', 'PSE', 'PYF', 'QAT', 'REU', 'ROU', 'RUS', 'RWA', 'SAU', 'SDN', 'SEN', 'SGP', 'SGS', 'SHN', 'SJM', 'SLB', 'SLE', 'SLV', 'SMR', 'SOM', 'SPM', 'SRB', 'SSD', 'STP', 'SUR', 'SVK', 'SVN', 'SWE', 'SWZ', 'SXM', 'SYC', 'SYR', 'TCA', 'TCD', 'TGO', 'THA', 'TJK', 'TKL', 'TKM', 'TLS', 'TON', 'TTO', 'TUN', 'TUR', 'TUV', 'TWN', 'TZA', 'UGA', 'UKR', 'UMI', 'URY', 'USA', 'UZB', 'VAT', 'VCT', 'VEN', 'VGB', 'VIR', 'VNM', 'VUT', 'WLF', 'WSM', 'XKX', 'YEM', 'ZAF', 'ZMB', 'ZWE', 'ZNC', 'UNK', 'SCG', 'XAD'])),
        ('start_time', Datetime(time_unit='ms', time_zone='UTC')),
        ('end_time', Datetime(time_unit='ms', time_zone='UTC')),
        ('gas', Enum(categories=['co2', 'ch4', 'n2o', 'co2e_100yr'])),
        ('sector',
         Enum(categories=['agriculture', 'buildings', 'fluorinated-gases', 'forestry-and-land-use', 'fossil-fuel-operations', 'manufacturing', 'mineral-extraction', 'power', 'transportation', 'waste'])),
        ('subsector',
         Enum(categories=['aluminum', 'bauxite-mining', 'biological-treatment-of-solid-waste-and-biogenic', 'cement', 'chemicals', 'coal-mining', 'copper-mining', 'crop-residues', 'cropland-fires', 'domestic-aviation', 'domestic-shipping', 'domestic-shipping-ship', 'domestic-wastewater-treatment-and-discharge', 'electricity-generation', 'enteric-fermentation-cattle-operation', 'enteric-fermentation-cattle-pasture', 'enteric-fermentation-other', 'fluorinated-gases', 'food-beverage-tobacco', 'forest-land-clearing', 'forest-land-degradation', 'forest-land-fires', 'glass', 'heat-plants', 'incineration-and-open-burning-of-waste', 'industrial-wastewater-treatment-and-discharge', 'international-aviation', 'international-shipping', 'international-shipping-ship', 'iron-and-steel', 'iron-mining', 'lime', 'manure-applied-to-soils', 'manure-left-on-pasture-cattle', 'manure-management-cattle-operation', 'manure-management-other', 'net-forest-land', 'net-shrubgrass', 'net-wetland', 'non-residential-onsite-fuel-usage', 'oil-and-gas-production', 'oil-and-gas-refining', 'oil-and-gas-transport', 'other-agricultural-soil-emissions', 'other-chemicals', 'other-energy-use', 'other-fossil-fuel-operations', 'other-manufacturing', 'other-metals', 'other-mining-quarrying', 'other-onsite-fuel-usage', 'other-transport', 'petrochemical-steam-cracking', 'pulp-and-paper', 'railways', 'removals', 'residential-onsite-fuel-usage', 'rice-cultivation', 'road-transportation', 'road-transportation-road-segment', 'rock-quarrying', 'sand-quarrying', 'shrubgrass-fires', 'soil-organic-carbon', 'solid-fuel-transformation', 'solid-waste-disposal', 'synthetic-fertilizer-application', 'textiles-leather-apparel', 'water-reservoirs', 'wetland-fires', 'wood-and-wood-products'])),
        ('emissions_quantity', Float64),
        ('emissions_quantity_units', Categorical(ordering='physical')),
        ('temporal_granularity',
         Enum(categories=['annual', 'other', 'month', 'week', 'day', 'hour'])),
        ('created_date', Datetime(time_unit='ms', time_zone='UTC')),
        ('modified_date', Datetime(time_unit='ms', time_zone='UTC'))])

Another trick is to define all the Dataframe columns that we are going to manipulate. For example, there is a column called gas that refers to all the gas being tabulated. Coming from the Pandas world, it would be normal to refer to this column:

cedf_pdf[cedf_pdf["gas"]=="ch4"]

Polars offers a more succint syntax by considering an abstract column, for example for the gas:

pl.col("gas")

col("gas")

That col("gas") refers to any dataframe column called gas. The select method operates on a polars dataframe takes such columns as instructions to select specific columns.

cedf.select(pl.col("gas"));

The ctrace library predefines all the columns defined in the Climate TRACE datasets, all prefixed by the c_ prefix:

ct.constants.c_gas

col("gas")

This reduces the chance of making a typo, allows us to use the autocompletion features of editors, and it will catch any mistake before executing the code. Selecting all the data related to CO2 is simply:

# cedf.filter(c_gas == CO2);

Using a different data processing framework

This notebook is using Polars. Are you more familiar with other frameworks such as pandas, PySpark, Modin, DuckDB, R’s dataframes? No problem. Polars can convert directly to most of these other representations. Here is an example for pandas below. You can also directly point to the underlying Parquet representation on the HuggingFace Hub of the project.

# Use pandas instead:
cedf_pandas = ct.read_country_emissions(GAS_LIST).to_pandas()
cedf_pandas.head(3)

	iso3_country	start_time	end_time	gas	sector	subsector	emissions_quantity	emissions_quantity_units	temporal_granularity	created_date	modified_date
0	ABW	2015-01-01 00:00:00+00:00	2015-12-31 00:00:00+00:00	co2	fossil-fuel-operations	other-fossil-fuel-operations	0.000000	NaN	annual	NaT	NaT
1	ABW	2015-01-01 00:00:00+00:00	2015-12-31 00:00:00+00:00	co2	mineral-extraction	bauxite-mining	0.000000	NaN	annual	NaT	NaT
2	ABW	2015-01-01 00:00:00+00:00	2015-12-31 00:00:00+00:00	co2	transportation	domestic-shipping	90613.375994	NaN	annual	NaT	NaT

Country checks#

In this section, we identify the biggest emitters and sort out a few unexpected insights.

We focus this analysis on the year 2023 and on the aggregated emissions for carbon dioxyde (CO2). You can look at how the results change when you focus on another specific gas (co2, nh4, …) or the global warming potentials (co2e_100yr)

gas = CO2 #gas = CO2E_100YR
year = 2023

cedf_gy = cedf.filter(c_gas == gas).filter(c_start_time.dt.year()==year)

First question: how much do we emit? About 50GT of CO2 (gigatonnes of CO2). Note the first wrinkle already: this takes into account the absorption by the vegetation. If we did not take into action the beneficial actions of our friends the trees, our emissions would be already higher. Trees will come back as a complicated topic.

Here are the net emissions, taking into account all the contribution of the forestry and land use sector: 55GT of CO2 equivalent

cedf_gy.select(c_emissions_quantity.sum())

shape: (1, 1)

emissions_quantity
f64
5.0609e10

Looking at the increase by year, we see that the emissions of CO2 are still increasing.

px.line(
    cedf
        .filter(c_emissions_quantity > 0)
        .group_by(c_start_time.dt.year(), c_gas)
        .agg(c_emissions_quantity.sum())
        .sort(by=c_start_time),
    x="start_time", y="emissions_quantity",
    color="gas")

Most reports do not account for the forestry and land use, because it is very hard to measure and because there are still some debates on what exactly to report in that category. Excluding this category, we get to a amount of 46GT CO2 equivalent, which is what Climate TRACE reports on its website:

(cedf_gy
 .filter(c_sector != FORESTRY_AND_LAND_USE)
 .select(c_emissions_quantity.sum()))

shape: (1, 1)

emissions_quantity
f64
4.6174e10

For now, we will still include forestry and land use. You will see how much this can skew the figures and why it is at the center of many discussions.

And when splitting between sinks and sources

(
    cedf_gy
    .select([c_emissions_quantity])
    .with_columns((c_emissions_quantity > 0).alias("is_source"))
    .group_by("is_source")
    .agg(c_emissions_quantity.sum())
    .drop_nulls()
)

shape: (2, 2)

is_source	emissions_quantity
bool	f64
false	-1.5872e10
true	6.6481e10

Look at the top emitters for all the emissions sources. The usual suspects come at the top (China, USA, Russia). However, it is very important to keep the orders of magnitude in mind: China has more emissions than the next 3 countries (USA, Russia, India) combined.

px.bar(cedf_gy
    .filter(c_emissions_quantity > 0)
    .group_by(c_iso3_country)
    .agg(c_emissions_quantity.sum())
    .sort([c_emissions_quantity], descending=True)
    .head(20)
,x=ISO3_COUNTRY, y=EMISSIONS_QUANTITY,log_y=False)

Looking at each sector gives quickly a nuanced sector and shows the difference between the various countries.

the Chinese electricity sector emits more than Western Europe combined. This is not to say that Europeans should not make efforts - they should!. But these efforts can come from many ways, such as incentivizing China to reduce the role of coal in its domestic supply of electricity.
the emissions from the oil and gas sector in Russia is bigger than all the emissions of Germany. These emissions are typically quick wins (fixing leaks on old pipes).
Zimbabwe and Mozambique have as much gross emissions together as Japan! Clearing forested areas are very bad for the climate and have only short-term economic benefits. Finding fairer ways to preserve natural resources would also go a long way towards helping developing countries achieve their goals.
France is about twice as big as Russia looking at the economy (GDP), yet it contributes much less. This view emphasizes direct emissions (where gases are emitted), not the indirect emissions from consumption.

CTODO

Check these conclusions with experts

px.bar(cedf_gy
    .group_by(c_iso3_country, c_subsector)
    .agg(c_emissions_quantity.sum())
    .sort([c_emissions_quantity], descending=True)
    .filter(c_iso3_country.is_in([
        "CHN", "USA", "IND", "RUS","MOZ","FRA", "NLD"]))
,x=ISO3_COUNTRY, y=EMISSIONS_QUANTITY,color=SUBSECTOR,log_y=False)

Source checks#

Let’s look now at all the sources tracked by Climate TRACE.

We load the data first.

Note

Technical

We load the data using scan_parquet. This instructs Polars to delay the actual loading in memory until requested. The amount of memory necessary will then be minimal, even if the dataset itself is 4GB.
We reset the schema, in particular we add all the information about enumerations. This provides better type checks and helps Polars with optimizing its queries.

You can see that Polars has not done any processing if you attempt to display the data:

sdf_gy = ct.read_source_emissions(gas=gas, year=year)
sdf_gy

NAIVE QUERY PLAN

run LazyFrame.show_graph() to see the optimized version

How much emissions are released according to the source tracking? This number is significantly higher compared to what was estimated per country (57GT CO2 instead of 41GT CO2).

The forestry and land use category is notoriously hard to estimate, so we are going for now to leave it aside

(sdf_gy
 .filter(c_sector != FORESTRY_AND_LAND_USE)
 .select(c_emissions_quantity.sum())
 .collect()
)

shape: (1, 1)

emissions_quantity
f64
5.7449e10

The biggest sources#

What are the biggest sources of emissions? Fossil fuel operations (the Permian bassin in Texas, oil fields in Russia) are amongst the largest sources. Agriculture also plays a huge role, especially Brazil and India.

(sdf_gy
.filter(c_emissions_quantity > 0)
.filter(c_sector != FORESTRY_AND_LAND_USE)
 .group_by(c_source_id)
 .agg(c_iso3_country.first(), c_sector.first(), c_subsector.first(), c_emissions_quantity.sum(), c_source_name.first())
.top_k(50, by=c_emissions_quantity)
.collect(streaming=True))

shape: (50, 6)

source_id	iso3_country	sector	subsector	emissions_quantity	source_name
u64	enum	enum	enum	f64	str
1241948	"CHN"	"buildings"	"residential-onsite-fuel-usage"	5.9804e8	"China"
1242200	"USA"	"buildings"	"residential-onsite-fuel-usage"	4.6791e8	"United States"
3588448	"RUS"	"fossil-fuel-operations"	"oil-and-gas-transport"	2.5986e8	"Russian Federation_West Siberi…
10720469	"BRA"	"agriculture"	"cropland-fires"	2.2666e8	"Brazil"
10720302	"IND"	"agriculture"	"cropland-fires"	2.1886e8	"India"
…	…	…	…	…	…
3597677	"BRA"	"transportation"	"road-transportation"	4.6394e7	"São Paulo"
3588663	"IRQ"	"fossil-fuel-operations"	"oil-and-gas-production"	4.6277e7	"Iraq_Widyan - North Arabian Gu…
3600747	"USA"	"transportation"	"road-transportation"	4.5956e7	"New York"
3588498	"USA"	"fossil-fuel-operations"	"oil-and-gas-transport"	4.5931e7	"United States_Appalachian_Shal…
3600726	"USA"	"transportation"	"road-transportation"	4.5479e7	"Illinois"

Number of tracked sources#

How many sources were tracked last year?

We get more than 1.2millions if the forestry and land use changes are included.

Technical:

Notice that we did not need to ingest the data in a database, and yet this query takes less than a second. Pretty good for a decently large dataset! Try to reproduce this analysis in Pandas and see what happens.

CTODO

The CT map seems to report 1.8M.

(sdf_gy.select(c_source_id.unique_counts())
 .count()
 .collect()
 .item() # Using .item() to have a single number
)

Excluding them, we get the number reported officially by Climate TRACE (748k).

(sdf_gy.filter(c_sector != FORESTRY_AND_LAND_USE)
.select(c_source_id.unique_counts())
 .count()
 .collect()
 .item()
)

Which categories do these records come from?

This shows the full diversity of the sources to consider. The most numerous records concern ships, followed by various treatment plants for cities, forest statistics, etc.

px.bar(sdf_gy
 .select(c_subsector.value_counts(sort=True))
 .collect()
 .unnest(SUBSECTOR),
      x=SUBSECTOR, y="count", log_y=True
)

Looking at a few examples of wastewater treatment plants, you see how the data is organized for each source:

an identifier, along with country, UNFCCC sectorial categorization, time of data collection
the gas considered
the main methodology to get the emission: typically a certain quantity of interest, called activity (here a population) multiplied by an emission factor
more details about the source itself: name, type, position
some extra tabular information (always strings in this dataset)
qualitative confidence values for each of the values

(sdf_gy
 .filter(c_sector == MANUFACTURING)
 .filter(c_subsector == CEMENT)
 .head(3)
 .collect()
)

shape: (3, 55)

source_id	iso3_country	sector	subsector	original_inventory_sector	start_time	end_time	temporal_granularity	gas	emissions_quantity	emissions_factor	emissions_factor_units	capacity	capacity_units	capacity_factor	activity	activity_units	created_date	modified_date	source_name	source_type	lat	lon	other1	other2	other3	other4	other5	other6	other7	other8	other9	other10	other11	other12	other1_def	other2_def	other3_def	other4_def	other5_def	other6_def	other7_def	other8_def	other9_def	other10_def	other11_def	other12_def	geometry_ref	conf_source_type	conf_capacity	conf_capacity_factor	conf_activity	conf_emissions_factor	conf_emissions_quantity	year
u64	enum	enum	enum	str	datetime[μs, UTC]	datetime[μs, UTC]	enum	enum	f64	f64	str	f64	str	f64	f64	str	datetime[μs, UTC]	datetime[μs, UTC]	str	str	f64	f64	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	str	enum	enum	enum	enum	enum	enum	i64
1895560	"CHN"	"manufacturing"	"cement"	null	2023-05-01 00:00:00 UTC	2023-05-31 00:00:00 UTC	"month"	"co2"	44912.138625	0.54	"t of CO2 per t of cement"	212917.0	"t of cement"	0.390625	83170.627083	"t of cement"	2024-08-21 00:00:00 UTC	null	"Anyang Hubo Clinker Co Ltd Hen…	"integrated dry"	36.085841	114.068585	"0.5952506500000001"	"49507.36983219091"	"0.34"	"28278.013208292858"	"0.2"	"16634.125416642855"	"0.091"	"7568.527064572498"	"0.60715"	"satellite"	null	null	"Direct and Indirect emissions …	"Direct and Indirect emissions:…	"Calcination emissions factor (…	"Calcination emissions (t of CO…	"Fuel emissions factor (t of CO…	"Fuel emissions (t of CO2)"	"electricity_use_factor (MWh pe…	"electricity_use (MWh)"	"grid_emissions_intensity (t CO…	"model_methodology (e.g satelli…	null	null	"trace_1895560"	"high"	"high"	"low"	"low"	"low"	"low"	2023
1895560	"CHN"	"manufacturing"	"cement"	null	2023-06-01 00:00:00 UTC	2023-06-30 00:00:00 UTC	"month"	"co2"	105667.939179	0.54	"t of CO2 per t of cement"	212917.0	"t of cement"	0.91905	195681.36885	"t of cement"	2024-08-21 00:00:00 UTC	null	"Anyang Hubo Clinker Co Ltd Hen…	"integrated dry"	36.085841	114.068585	"0.5973491099999999"	"116890.0915261292"	"0.34"	"66531.665409"	"0.1999999999999999"	"39136.27376999999"	"0.091"	"17807.00456535"	"0.63021"	"satellite"	null	null	"Direct and Indirect emissions …	"Direct and Indirect emissions:…	"Calcination emissions factor (…	"Calcination emissions (t of CO…	"Fuel emissions factor (t of CO…	"Fuel emissions (t of CO2)"	"electricity_use_factor (MWh pe…	"electricity_use (MWh)"	"grid_emissions_intensity (t CO…	"model_methodology (e.g satelli…	null	null	"trace_1895560"	"high"	"high"	"low"	"low"	"low"	"low"	2023
1895560	"CHN"	"manufacturing"	"cement"	null	2023-07-01 00:00:00 UTC	2023-07-31 00:00:00 UTC	"month"	"co2"	17822.384777	0.54	"t of CO2 per t of cement"	212917.0	"t of cement"	0.155011	33004.416254	"t of cement"	2024-08-21 00:00:00 UTC	null	"Anyang Hubo Clinker Co Ltd Hen…	"integrated dry"	36.085841	114.068585	"0.59744466"	"19718.312247113456"	"0.34"	"11221.501526214286"	"0.2"	"6600.883250714286"	"0.091"	"3003.401879075"	"0.63126"	"satellite"	null	null	"Direct and Indirect emissions …	"Direct and Indirect emissions:…	"Calcination emissions factor (…	"Calcination emissions (t of CO…	"Fuel emissions factor (t of CO…	"Fuel emissions (t of CO2)"	"electricity_use_factor (MWh pe…	"electricity_use (MWh)"	"grid_emissions_intensity (t CO…	"model_methodology (e.g satelli…	null	null	"trace_1895560"	"high"	"high"	"low"	"low"	"low"	"low"	2023

Looking now at emissions and gathering by country, we get a plot relatively similar to what we saw in the section above.

Again, France and the Netherlands barely appear in this graph, and Mozambique’s forest management practices have an enormous impact.

px.bar(
(sdf_gy
.group_by(c_iso3_country, c_subsector)
 .agg(c_emissions_quantity.sum())
     .sort([c_emissions_quantity], descending=True)
 .filter(c_iso3_country.is_in(["CHN", "USA", "IND", "RUS","MOZ","FRA", "NLD"]))
 .collect()
)
    ,x=ISO3_COUNTRY, y=EMISSIONS_QUANTITY,color=SUBSECTOR,log_y=False)

Looking by sector, it is easy to see why the forestry and land use sector is complex: it would dominate both in retention and in emission.

px.bar(
(sdf_gy
.group_by(c_sector, c_subsector)
 .agg(c_emissions_quantity.sum())
     .sort([c_emissions_quantity], descending=True)
 .collect()
)
    ,x=SECTOR, y=EMISSIONS_QUANTITY,color=SUBSECTOR,log_y=False)

Conclusion#

In this section, we saw:

how to access the Climate TRACE dataset in a modern data processing tool (Polars)
how to produce high-level statistics per country and per sector

From the data, it should be clear that tackling emissions is a global issue, in which a few countries dominate. Some countries with large economic outputs (France, The Netherlands) can have minimal emissions, in part because they have low-emission means of producing electricity and they have outsourced their industries. Some others with comparably small economic outputs or populations (ex: Mozambique) have a disproportionate impact because of the changes occurring in their ecosystems.

Data exploration of the Climate TRACE dataset

Contents

Data exploration of the Climate TRACE dataset#

Country emissions#

Understanding the data format#

Country checks#

Source checks#

NAIVE QUERY PLAN

The biggest sources#

Number of tracked sources#

Conclusion#