Comparing source and country inventories#

The Climate TRACE dataset provides two sets of inventories:

  • tracking every source of emissions

  • using econonomic output statistics to derive country-level emissions

The second approach is described in more details in (TODO: ref to the implicit doc).

This chapter compares the two approaches and identifies gaps between the two results. We conclude with a statistical approach to combine the two estimates into a single, more precise estimate, using Bayesian statistics.

%load_ext autoreload
%autoreload 2
import os
os.environ["POLARS_ALLOW_FORKING_THREAD"] = "1" # Suppresses a warning from Polars.
import polars as pl
import plotly.express as px
import plotly.graph_objects as go # Will use more advanced visualization functions

from ctrace.constants import *
import ctrace as ct

We first load the data for the year 2023, only focusing on CO2-equivalent emissions. sdf_gy is the dataframe of the source emissions for this gas and this year.

year = 2023
gas = CO2E_100YR

cedf = ct.read_country_emissions(gas=gas)
sdf = ct.read_source_emissions(gas=gas,year=year)
sdf_gy = sdf.filter(c_gas == gas)
cedf_gy = cedf.filter(c_gas == gas).filter(c_start_time.dt.year()==year)

Comparing visually the two datasets#

Let us check first if the numbers roughly match for each country and for each sector. Are we in the same ballpark estimate between the tracking of the sources and country-level aggregations?

First aggregate the emissions by country, sector and sub-sector:

cedf_gy_agg = (cedf_gy
    .group_by(c_iso3_country, c_sector, c_subsector)
    .agg(c_emissions_quantity.sum())
)

sdf_gy_agg = (sdf_gy
    .group_by(c_iso3_country, c_sector, c_subsector)
    .agg(c_emissions_quantity.sum())
    .collect()
)

The jdf dataframe is the joint dataframe holding estimates for each sector, country, sub-sector. Some sectors and subsectors will be missing in one or the other dataframe, depending on what countries have reported.

Technical: we are using an full (or outer) join to capture both sides of the data, even if one is missing.

EMISSIONS_QUANTITY_S = EMISSIONS_QUANTITY + "_s"
EMISSIONS_QUANTITY_CE = EMISSIONS_QUANTITY + "_ce"

jdf = (cedf_gy_agg
       .join(sdf_gy_agg, on=[ISO3_COUNTRY, SECTOR, SUBSECTOR], suffix="_s", how="full")
.with_columns(
    c_iso3_country.fill_null(C(ISO3_COUNTRY+"_s")),
    c_sector.fill_null(C(SECTOR+"_s")),
    c_subsector.fill_null(C(SUBSECTOR+"_s")),
    c_emissions_quantity.alias(EMISSIONS_QUANTITY_CE),
).select(
    c_iso3_country,
    c_sector,
    c_subsector,
    EMISSIONS_QUANTITY_S,
    EMISSIONS_QUANTITY_CE
))
jdf
shape: (16_819, 5)
iso3_countrysectorsubsectoremissions_quantity_semissions_quantity_ce
enumenumenumf64f64
"CHL""forestry-and-land-use""net-wetland"-93619.168213-32513.23067
"BTN""fossil-fuel-operations""solid-fuel-transformation"null87816.680736
"BIH""agriculture""enteric-fermentation-cattle-op…null534531.68
"LTU""buildings""non-residential-onsite-fuel-us…933810.8793256834.952
"SOM""forestry-and-land-use""removals"5.6865e71.8949e7
"ZNC""transportation""road-transportation"1.4739e6null
"ZNC""agriculture""rice-cultivation"0.0null
"ZNC""buildings""residential-onsite-fuel-usage"334255.36239null
"ZNC""agriculture""synthetic-fertilizer-applicati…2710.094284null
"ZNC""buildings""non-residential-onsite-fuel-us…63732.35214null

Quick check where we have differences in missing data. So far so good, all emissions accounted in the sources seem at least reported at country level.

(jdf.group_by(
    C(EMISSIONS_QUANTITY_S).is_not_null().alias("present_s"),
    C(EMISSIONS_QUANTITY_CE).is_not_null().alias("present_ce"))
 .agg(c_iso3_country.count())
)
shape: (3, 3)
present_spresent_ceiso3_country
boolboolu32
truefalse5
falsetrue10109
truetrue6705

The data reported at country level does not include retention (forestry).

(jdf
 .with_columns(((C(EMISSIONS_QUANTITY_S) > 0) | (C(EMISSIONS_QUANTITY_CE) > 0)).alias("is_emission"))
 .group_by("is_emission")
 .agg(C(EMISSIONS_QUANTITY_S).sum(), C(EMISSIONS_QUANTITY_CE).sum()))
shape: (3, 3)
is_emissionemissions_quantity_semissions_quantity_ce
boolf64f64
null0.0-26.0
true1.3367e118.1590e10
false-4.7599e10-1.5872e10

To better compare, we also add a ratio between the source-defined emissions and the country-defined emissions. If we knew perfectly the emissions, this number would be one. However:

  • the sources should be just an observed fraction of all the emissions, so this fraction should generally be expected to be less than one

  • the emissions are known only to some degree of accuracy. There is noise in the measurement. As a result, the fraction should be wiggling around one.

  • Some sectors are just not observed at all in some countries, so having infinite values is expected.

jdf_cmp = (jdf
    .with_columns((C(EMISSIONS_QUANTITY_S).fill_null(0.0)/C(EMISSIONS_QUANTITY_CE).fill_null(0.0)).alias("found"))
)
jdf_cmp.filter(pl.col("found") > 0).filter(pl.col("found") < 1e3)["found"].describe()
shape: (9, 2)
statisticvalue
strf64
"count"5675.0
"null_count"0.0
"mean"2.005608
"std"3.611759
"min"0.000022
"25%"0.999987
"50%"2.0
"75%"3.003054
"max"213.337823

Plotting one set of emissions against the other, what do we see?

First of all, the difference between various sources is massive (6 orders of magnitude between China’s power generation and the UK’s petland fires).

Also, the two datasets are quite aligned for most values. They have been compiled together and should not deviate too much. The country-level dataset may contain slightly more data derived from country statistics, and the source data takes into account sinks.

fig1 = px.scatter(jdf_cmp, x=EMISSIONS_QUANTITY_CE, y=EMISSIONS_QUANTITY_S,
           color=SECTOR,
           hover_name=ISO3_COUNTRY,
           hover_data=[SECTOR, SUBSECTOR])

fig2 = px.line(jdf_cmp, x=EMISSIONS_QUANTITY_S, y=EMISSIONS_QUANTITY_S).update_traces(marker=dict(
        color='red'))

fig = go.Figure(data = fig1.data + fig2.data)
(fig
 .update_xaxes(type="log")
 .update_yaxes(type="log")
 .update_layout(xaxis_title=EMISSIONS_QUANTITY_CE, yaxis_title=EMISSIONS_QUANTITY_S))

Combining both estimates with Bayesian analysis#

So far, we have been taking the figures for granted. Can we check if they are consistent with each other? Can we put some uncertainty margins around these numbers?

I am going to show in this section how Bayesian analysis can help, with a small example. Our goal is: for a given sector, given the reports from the sources and the report from country inventories, can we have a better guess at the true value of emissions?

We will take some values from the manufacturing sector, but this applies to any sector.

import arviz as az
import pymc as pm
import numpy as np
import scipy
import logging
logging.getLogger("matplotlib").setLevel("WARNING")
logging.getLogger("filelock").setLevel("WARNING")
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
subs_df = (jdf_cmp
 .filter(c_sector == "manufacturing")
 .filter((C("found") > 0.1) & (C("found") < 10))
 .drop_nulls()
 .sort(by=EMISSIONS_QUANTITY_S))
subs_df
shape: (585, 6)
iso3_countrysectorsubsectoremissions_quantity_semissions_quantity_cefound
enumenumenumf64f64f64
"NPL""manufacturing""textiles-leather-apparel"8.4619718.4619711.0
"ARE""manufacturing""textiles-leather-apparel"50.01366250.0136621.0
"TCD""manufacturing""food-beverage-tobacco"160.722639482.1679170.333333
"LAO""manufacturing""textiles-leather-apparel"308.114381308.1143831.0
"REU""manufacturing""food-beverage-tobacco"586.494898586.4948981.0
"JPN""manufacturing""iron-and-steel"1.5333e81.5991e80.958902
"CHN""manufacturing""chemicals"2.8828e82.8828e81.0
"CHN""manufacturing""lime"3.6817e83.6817e81.0
"CHN""manufacturing""cement"7.8082e81.1502e90.678852
"CHN""manufacturing""iron-and-steel"1.7387e92.2414e90.775705

We focus first on finding a good estimate of the true estimations based on what is reported from the sources and country inventories. The model will be built as follows. Suppose that the true emission of a given sector has a normalized value of 1GT CO2e. Furthermore, emissions are hard to measure in general, so reporting is noisy: for our example, a country may measure and report a figure like 0.95GT. Also, countries may not be reporting all their emissions, either because they do not have the means to collect all the information, or because they have an incentive to under-report the true value. It may lead to stronger commitments to reduce their emissions in the future. Undereporting is more likeley than overreporting: a lower number such as 0.8GT is more likely to be reported than 1.1GT. It makes more sense then to use a skewed statistical distribution that is leans towards lower values, that is, an asymetrical distribution. Also, emissions should be positive for the industry. Negative values are not likely: this distribution should be positive.

Given all these constraints, we can use Bayesian modeling frameworks to shape our posterior distribution, without having to bother about finding exact solutions. A first example of a model is this shifted-flipped Gamma distribution. We may argue about the exact parameters and the margins of errors, but this should give reasonable results as a start.

Let’s assume that this is also the expected shape for the source emissions, for similar reasons:

  • we do not know precisely how much we missed

  • for what we are accounting, there is also a margin of error

This model can be refined based on expert knowledge (we may have further information about how much was missed for a sector).

Here is our definition of the shifted-flipped Gamma distribution. If the true value were 1GT CO2e, we would expect the following distribution of values for source and country emissions:

def shiftedGamma(mean, size):
    # Picking some sensible parameters
    alpha = 3.0
    beta = 2.0
    # mode = (alpha-1) / beta # Putting the observed value on the mode of the distribution
    mean_ = (alpha) / beta
    # Mapping 10.m -> 0, m -> 1.0
    return mean  * (4/3. - (1.0/3) * (1.0 / mean_) * pm.Gamma.dist(alpha=alpha, beta = beta, size=size))

x = shiftedGamma(mean=1.0, size=1)
xs = np.linspace(0,2,100)
ys = pm.logp(x, xs).eval()
px.line(x=xs,y=np.exp(ys))

Armed with this first model, we define our statistical inference problem: given two observations of the emissions (per country and per source), what is the likely distribution of true emission values?

Here is an implementation of this model in PyMC, a popular library for Bayesian analysis.

basic_model = pm.Model()

# The values of the emissions per source and per country
em_s = np.array(subs_df[EMISSIONS_QUANTITY_S])
em_ce = np.array(subs_df[EMISSIONS_QUANTITY_CE])

# Given the wide range of scales, normalizing them around the source emissions
norm_factor = em_s
em_s_normed = em_s / norm_factor
em_ce_normed = em_ce / norm_factor
start_pt_normed = scipy.stats.gmean([em_s_normed, em_ce_normed], axis=0)


with basic_model:
    # The initial value: the geometric mean of both estimate
    init_est = start_pt_normed
    # The Gamma distribution is only defined over positive values, so we need to clamp the values of emissions to prevent 
    # PyMC from sampling values with zero probability.
    max_bound = np.maximum(em_s_normed, em_ce_normed)
    epsi = pm.TruncatedNormal("epsi", mu = init_est, sigma=1.5 * init_est, lower=max_bound * 0.75, upper=max_bound * 5.0)
    # TODO: maybe just use gamma distributions directly?
    cp = pm.CustomDist("cp", epsi, dist=shiftedGamma, observed=em_ce_normed)
    ip = pm.CustomDist("ip", epsi, dist=shiftedGamma, observed=em_s_normed)

We run the sampler with 4 chains. The default parameters should work out of the box for such a simple model.

num_chains = 4
with basic_model:
    idata = pm.sample(chains=num_chains)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [epsi]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.
# Not plotting, too many variables to plot
# az.plot_trace(idata, combined=True);
summ = az.summary(idata, round_to=2,stat_funcs={"median":lambda z:np.percentile(z,50)})
summ
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat median
epsi[0] 0.96 0.14 0.79 1.20 0.00 0.00 8367.19 2169.43 1.0 0.93
epsi[1] 0.96 0.13 0.79 1.20 0.00 0.00 10810.64 2664.18 1.0 0.93
epsi[2] 2.81 0.43 2.28 3.54 0.01 0.01 8284.20 2479.78 1.0 2.70
epsi[3] 0.96 0.13 0.78 1.18 0.00 0.00 10373.10 2449.38 1.0 0.93
epsi[4] 0.96 0.13 0.79 1.19 0.00 0.00 8440.14 2113.29 1.0 0.93
... ... ... ... ... ... ... ... ... ... ...
epsi[580] 0.99 0.14 0.81 1.21 0.00 0.00 9936.50 2599.38 1.0 0.96
epsi[581] 0.96 0.14 0.79 1.20 0.00 0.00 11139.89 2420.94 1.0 0.93
epsi[582] 0.96 0.13 0.79 1.19 0.00 0.00 9262.24 2497.90 1.0 0.93
epsi[583] 1.35 0.19 1.12 1.66 0.00 0.00 10163.62 2270.04 1.0 1.31
epsi[584] 1.19 0.17 0.98 1.46 0.00 0.00 9985.11 2698.44 1.0 1.14

585 rows × 10 columns

All the inferred values are aggregated back with our dataset.

The Bayesian analysis also can tell us if the observations are likely to be close to the true value, or if they are inconsistent with the inferred range of possible emissions values. If an observed value for the source or the country emissions is marked as invalid, we can conclude that this combination is not consistent and that we should refine these values by better analysis.

summ = summ.reset_index()
HDI_LOW = "hdi_3%"
HDI_HIGH = "hdi_97%"
summ["mean"] *= norm_factor
summ["median"] *= norm_factor
summ["sd"] *= norm_factor
summ[HDI_LOW] *= norm_factor
summ[HDI_HIGH] *= norm_factor
summ["s_valid"] = (summ[HDI_LOW] < em_s) & (em_s < summ[HDI_HIGH])
summ["ce_valid"] = (summ[HDI_LOW] < em_ce) & (em_ce < summ[HDI_HIGH])
summ["valid"] = summ["s_valid"] & summ["ce_valid"]
summ["s"] = em_s
summ["ce"] = em_ce
summ[ISO3_COUNTRY] = subs_df[ISO3_COUNTRY]
summ[SECTOR] = subs_df[SECTOR]
summ[SUBSECTOR] = subs_df[SUBSECTOR]
summ[EMISSIONS_QUANTITY_S]=subs_df[EMISSIONS_QUANTITY_S]
summ[EMISSIONS_QUANTITY_CE]=subs_df[EMISSIONS_QUANTITY_CE]
summ
index mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat ... s_valid ce_valid valid s ce iso3_country sector subsector emissions_quantity_s emissions_quantity_ce
0 epsi[0] 8.123492e+00 1.184676e+00 6.684957e+00 1.015437e+01 0.00 0.00 8367.19 2169.43 1.0 ... True True True 8.461971e+00 8.461971e+00 NPL manufacturing textiles-leather-apparel 8.461971e+00 8.461971e+00
1 epsi[1] 4.801312e+01 6.501776e+00 3.951079e+01 6.001639e+01 0.00 0.00 10810.64 2664.18 1.0 ... True True True 5.001366e+01 5.001366e+01 ARE manufacturing textiles-leather-apparel 5.001366e+01 5.001366e+01
2 epsi[2] 4.516306e+02 6.911073e+01 3.664476e+02 5.689581e+02 0.01 0.01 8284.20 2479.78 1.0 ... False True False 1.607226e+02 4.821679e+02 TCD manufacturing food-beverage-tobacco 1.607226e+02 4.821679e+02
3 epsi[3] 2.957898e+02 4.005487e+01 2.403292e+02 3.635750e+02 0.00 0.00 10373.10 2449.38 1.0 ... True True True 3.081144e+02 3.081144e+02 LAO manufacturing textiles-leather-apparel 3.081144e+02 3.081144e+02
4 epsi[4] 5.630351e+02 7.624434e+01 4.633310e+02 6.979289e+02 0.00 0.00 8440.14 2113.29 1.0 ... True True True 5.864949e+02 5.864949e+02 REU manufacturing food-beverage-tobacco 5.864949e+02 5.864949e+02
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
580 epsi[580] 1.518015e+08 2.146688e+07 1.242012e+08 1.855352e+08 0.00 0.00 9936.50 2599.38 1.0 ... True True True 1.533348e+08 1.599067e+08 JPN manufacturing iron-and-steel 1.533348e+08 1.599067e+08
581 epsi[581] 2.767517e+08 4.035962e+07 2.277436e+08 3.459396e+08 0.00 0.00 11139.89 2420.94 1.0 ... True True True 2.882830e+08 2.882830e+08 CHN manufacturing chemicals 2.882830e+08 2.882830e+08
582 epsi[582] 3.534442e+08 4.786223e+07 2.908551e+08 4.381235e+08 0.00 0.00 9262.24 2497.90 1.0 ... True True True 3.681710e+08 3.681710e+08 CHN manufacturing lime 3.681710e+08 3.681710e+08
583 epsi[583] 1.054101e+09 1.483550e+08 8.745137e+08 1.296154e+09 0.00 0.00 10163.62 2270.04 1.0 ... False True False 7.808158e+08 1.150200e+09 CHN manufacturing cement 7.808158e+08 1.150200e+09
584 epsi[584] 2.069005e+09 2.955722e+08 1.703887e+09 2.538443e+09 0.00 0.00 9985.11 2698.44 1.0 ... True True True 1.738660e+09 2.241392e+09 CHN manufacturing iron-and-steel 1.738660e+09 2.241392e+09

585 rows × 21 columns

Here are all the estimated ranges of emission values for the considered sectors.

Looking at the top sector (Chinese manufacturing), our Bayesian analysis inferred an median emission of 1.14BT CO2e, and the true value being very likely between 0.96BT and 1.4BT (94% confidence interval). However, the value of the source emission is 0.9BT. Based on our model, this value is likely to be inaccurate and we should explore further if we can refine it.

Zoom and hover over the data points to see how much the values diverge.

px.scatter(
    summ,y="median",
    color="valid",
    log_y=True,
    error_y = summ[HDI_HIGH]-summ["median"],
    error_y_minus=summ["median"]-summ[HDI_LOW],
    hover_data=[SECTOR,ISO3_COUNTRY,SUBSECTOR,EMISSIONS_QUANTITY_S,EMISSIONS_QUANTITY_CE]
)

This is another visualization that puts all the data in the same plot. You can see how this simple model shines for estimating conservative figures: between two wildly different values, it will lean towards the higher one, and when they are close, it will average them out.

fig = go.Figure()
xs = list(range(len(summ)))
fig.add_trace(
go.Scatter(
        x=xs,
        y=summ["median"],
    name="estimate",
        mode='markers',
        error_y=dict(
            type='data',
            symmetric=False,
            array=summ[HDI_HIGH]-summ["median"],
            arrayminus=summ["median"]-summ[HDI_LOW])
        )
    
)
fig.add_trace(
    go.Scatter(
        x=xs,
        y=summ["s"],
        name="source",
        mode='markers',
        )
)
fig.add_trace(go.Scatter(
        x=xs,
        y=summ["ce"],
        name="countries",
        mode='markers',
        ))
fig.update_yaxes(type="log")

fig.show()

Conclusion#

In this section, we first saw how to visually inspect the differences between emissions of different countries, both from a source level and country report level. We identified a couple of sectors in which emissions diverge between both methods of reporting.

Finally, we saw how to use Bayesian analysis to start combining multiple data sources together. This example only considers two estimates. See as an exercise what happens with different distributions, such as Gaussian distributions for example.