Climate change: study of temperature evolutions¶
By Emile Hazard
Data source: https://www.kaggle.com/datasets/tarunrm09/climate-change-indicators?resource=download
# Database
import numpy as np
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import seaborn as sns
import geopandas as gpd
import geodatasets
# Polynomial regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Clustering
from sklearn.cluster import KMeans
# For reproducibility
RANDOM_STATE = 42
data_path = "data/climate_change_indicators.csv"
db = pd.read_csv(data_path)
Getting familiar with the dataset¶
First let us see what it looks like.
db.head(2)
ObjectId | Country | ISO2 | ISO3 | Indicator | Unit | Source | CTS_Code | CTS_Name | CTS_Full_Descriptor | ... | F2013 | F2014 | F2015 | F2016 | F2017 | F2018 | F2019 | F2020 | F2021 | F2022 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Afghanistan, Islamic Rep. of | AF | AFG | Temperature change with respect to a baseline ... | Degree Celsius | Food and Agriculture Organization of the Unite... | ECCS | Surface Temperature Change | Environment, Climate Change, Climate Indicator... | ... | 1.281 | 0.456 | 1.093 | 1.555 | 1.540 | 1.544 | 0.910 | 0.498 | 1.327 | 2.012 |
1 | 2 | Albania | AL | ALB | Temperature change with respect to a baseline ... | Degree Celsius | Food and Agriculture Organization of the Unite... | ECCS | Surface Temperature Change | Environment, Climate Change, Climate Indicator... | ... | 1.333 | 1.198 | 1.569 | 1.464 | 1.121 | 2.028 | 1.675 | 1.498 | 1.536 | 1.518 |
2 rows × 72 columns
See the list of columns below.
Columns F1961
to F2022
correspond to the temperature difference in the country compared to the average for the period 1951-1980.
db.columns
Index(['ObjectId', 'Country', 'ISO2', 'ISO3', 'Indicator', 'Unit', 'Source', 'CTS_Code', 'CTS_Name', 'CTS_Full_Descriptor', 'F1961', 'F1962', 'F1963', 'F1964', 'F1965', 'F1966', 'F1967', 'F1968', 'F1969', 'F1970', 'F1971', 'F1972', 'F1973', 'F1974', 'F1975', 'F1976', 'F1977', 'F1978', 'F1979', 'F1980', 'F1981', 'F1982', 'F1983', 'F1984', 'F1985', 'F1986', 'F1987', 'F1988', 'F1989', 'F1990', 'F1991', 'F1992', 'F1993', 'F1994', 'F1995', 'F1996', 'F1997', 'F1998', 'F1999', 'F2000', 'F2001', 'F2002', 'F2003', 'F2004', 'F2005', 'F2006', 'F2007', 'F2008', 'F2009', 'F2010', 'F2011', 'F2012', 'F2013', 'F2014', 'F2015', 'F2016', 'F2017', 'F2018', 'F2019', 'F2020', 'F2021', 'F2022'], dtype='object')
First plot¶
The next thing we can do is plot the evolution for a given country. We use the following function to isolate the temperatures for a country in a 1D array.
def get_temps_country(country, db = db):
return db.loc[db['Country'] == country, 'F1961':'F2022'].to_numpy().reshape(-1,1)
country = 'United Kingdom'
plt.plot(range(1961, 2023), get_temps_country(country))
plt.show()
Plotting all countries might let us see if a general trend can be observed. I add the mean of all temperatures each year for reference. Note that this mean does not represent any climatic notion, as it is not weighted by the size of each country. It is only here to ease the plot reading.
for country in db['Country']:
temps = get_temps_country(country)
plt.scatter(range(1961, 2023), temps, alpha = 0.1, c = temps, cmap = 'rainbow')
plt.plot(range(1961, 2023), np.mean(db.loc[:,'F1961':'F2022'], axis = 0), c ='black')
plt.xlabel("Year")
plt.ylabel("Relative temperature (°C)")
plt.title("Average temperature in world countries per year, relative to the mean for 1951-1980")
plt.show()
We can already observe the "redshift" in this plot, indicating a global change towards higher temperatures.
It also seems like the extreme readings are more frequent towards the right of the plot. We can test this by plotting the variance for each year. More precisely, we will plot the variance for each 5-years interval, to increase the size of the set and to ensure that a country is also compared with itself.
l = []
for i in range(1961, 2019):
temps_5_years = db.loc[:,'F' + str(i) : 'F' + str(i+4)].to_numpy()
l.append(np.nanvar(temps_5_years.flatten()))
plt.plot(range(1963, 2021), l)
plt.xlabel("Year")
plt.ylabel("Variance in temperature")
plt.title("Variance in temperature in 5-year intervals over all couries")
plt.show()
We do observe a general rise of that variance, which illustrates the idea of climate deregulation: this is not simply a matter of global warming, but also a question of balance that might be compromised.
Machine learning¶
Polynomial regression¶
To start learning, we should clean the database from NaN
values.
Let us first isolate properly the parameters.
X_train = np.tile(np.arange(1961, 2023), db.shape[0])
y_train = db.loc[:,'F1961':'F2022'].to_numpy().flatten()
# Getting the indices of NaN values
nanIndex = np.argwhere(np.isnan(y_train))
#Removing the corresponding entries
X_train = np.delete(X_train, nanIndex)
y_train = np.delete(y_train, nanIndex)
Coming back to the first plot, we can try to fit a polynomial to it.
poly = PolynomialFeatures(degree = 2)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
poly_reg_model = LinearRegression()
poly_reg_model.fit(X_train_poly, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Here is the result above the first plot. Change the value of last_year_predict
to see prevision for future years with this model.
last_year_predict = 2050
X_predict = np.arange(1961, last_year_predict + 1)
X_predict_poly = poly.fit_transform(X_predict.reshape(-1,1))
y_predict = poly_reg_model.predict(X_predict_poly)
plt.plot(X_predict, y_predict, c ='black')
plt.axhline(2.5, c = 'red', label = "2.5°C")
plt.axhline(2, c = 'orange', label = "2°C")
plt.axhline(1.5, c = 'green', label = "1.5°C")
for country in db['Country']:
temps = get_temps_country(country)
plt.scatter(range(1961, 2023), temps, alpha = 0.1, c = temps, cmap = 'rainbow')
plt.xlabel("Year")
plt.ylabel("Relative temperature (°C)")
plt.title("Average temperature in world countries per year, relative to the mean for 1951-1980")
plt.legend(loc = 'upper left')
plt.show()
According to this model, it seems that the Paris Agreement objective of keeping the global warming below 1.5°C above pre-industrial levels has basically already failed (green line).
Note that we are not comparing to pre-industrial levels here, but to the 1951-1980 period, which is a higher base level. However, this result should be taken carefully, as the model was trained on unweighted data: small and large countries contributed the same amount to the training. The next step to make this more robust would be to use a database from a regular grid of measurements. More complex models place the current temperature around 1°C above the pre-industrial era (https://www.climate.gov/news-features/understanding-climate/climate-change-global-temperature, see also https://climateprimer.mit.edu/ for a good overall presentation of climate science and modeling).
One last remark: when increasing the degree of the polynomial, we observe a decline after 2020, but since the concavity only starts after the present day, this should be considered as an artefact of the model rather than a likely prediction.
Clustering: evolution in 80s and last decade¶
This database can also help us compare, for each country, the difference between the eighties and the last decade. This is inspired from Lucas_dataartist.
Warning : Changing the value of RANDOM_STATE
(42) would change the resulting clusters and their colours, and therefore make the comments outdated.
d = {'Country': db['Country'],
'ISO3': db['ISO3'],
"First decade": db.loc[:, 'F1980':'F1989'].mean(axis = 1),
"Last decade": db.loc[:, 'F2013':'F2022'].mean(axis = 1)}
db_tempdiff = pd.DataFrame(d)
db_tempdiff.head(2)
Country | ISO3 | First decade | Last decade | |
---|---|---|---|---|
0 | Afghanistan, Islamic Rep. of | AFG | 0.2293 | 1.2216 |
1 | Albania | ALB | -0.0326 | 1.4940 |
Removing any NaN
value:
db_tempdiff = db_tempdiff.dropna()
db_tempdiff.plot.scatter("First decade", "Last decade")
plt.show()
We can now try to cluster these.
Let us first try with 4 clusters to split the plot into upper, lower, rightmost and leftmost parts.
Note that this is not the recommendation from the elbow method, but it makes sense to try these four for a more detailed classification. To try the elbow method nonetheless, change the value of do_elbow_method
below.
do_elbow_method = False # Change to True to check results
if do_elbow_method:
list_loss = []
for k in range(1,6):
kmeans = KMeans(n_clusters = k, n_init = 10, random_state = RANDOM_STATE)
kmeans.fit(db_tempdiff2.loc[:, ["First decade", "Last decade"]])
list_loss.append(kmeans.inertia_)
plt.plot(range(1,6), list_loss, marker = 'o', linestyle = '-')
plt.xlabel('Number of clusters')
plt.ylabel('Loss')
plt.title('Elbow method')
plt.grid(True)
plt.show()
kmeans = KMeans(n_clusters = 4, n_init = 10, random_state = RANDOM_STATE)
db_tempdiff['Cluster'] = kmeans.fit_predict(db_tempdiff.loc[:, ["First decade", "Last decade"]])
db_tempdiff.head(2)
Country | ISO3 | First decade | Last decade | Cluster | |
---|---|---|---|---|---|
0 | Afghanistan, Islamic Rep. of | AFG | 0.2293 | 1.2216 | 2 |
1 | Albania | ALB | -0.0326 | 1.4940 | 3 |
sns.scatterplot(db_tempdiff, x = "First decade", y = "Last decade", hue = 'Cluster', palette = ['red','blue','orange','green'], alpha = 0.3, s = 20)
sns.scatterplot(x = kmeans.cluster_centers_[:, 0], y = kmeans.cluster_centers_[:, 1], c = ['red','blue','orange','green'], s = 50, marker = 'x', linewidth=3)
plt.legend().remove()
plt.show()
What does the cluster mean for a country temperature:
- Blue cluster: low rise
- Green cluster: medium rise, recently
- Orange cluster: medium rise, started earlier
- Red cluster: high rise
This can be summed up in a map.
(Source for mapping: https://www.naturalearthdata.com/downloads/110m-cultural-vectors/, gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
is deprecated).
worldmap_path = "../tools/ne_110m_admin_0_countries.zip"
world = gpd.read_file(worldmap_path)
# Other (deprecated) method:
#world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
world.head(2)
featurecla | scalerank | LABELRANK | SOVEREIGNT | SOV_A3 | ADM0_DIF | LEVEL | TYPE | TLC | ADMIN | ... | FCLASS_TR | FCLASS_ID | FCLASS_PL | FCLASS_GR | FCLASS_IT | FCLASS_NL | FCLASS_SE | FCLASS_BD | FCLASS_UA | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Admin-0 country | 1 | 6 | Fiji | FJI | 0 | 2 | Sovereign country | 1 | Fiji | ... | None | None | None | None | None | None | None | None | None | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... |
1 | Admin-0 country | 1 | 3 | United Republic of Tanzania | TZA | 0 | 2 | Sovereign country | 1 | United Republic of Tanzania | ... | None | None | None | None | None | None | None | None | None | POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... |
2 rows × 169 columns
# Replace 'ADM0_A3' with 'iso_a3' if using the deprecated map: gpd.read_file(gpd.datasets...)
world_clusters = pd.merge(world.rename(columns = {'ADM0_A3' : 'ISO3'}), db_tempdiff.loc[:, ['ISO3', 'Cluster']], how = 'left')
# If we want int for clusters
world_clusters['Cluster'] = world_clusters['Cluster'].fillna(4).astype('int')
colors = ['red','blue','orange','green', 'gray']
names = ['High','Low','Medium early','Medium late', 'Missing data']
world_clusters['Cluster_name'] = world_clusters['Cluster'].map(lambda x : names[x], na_action = 'ignore')
world_clusters['Color'] = world_clusters['Cluster'].map(lambda x : colors[x], na_action = 'ignore')
world_clusters.tail(2)
featurecla | scalerank | LABELRANK | SOVEREIGNT | SOV_A3 | ADM0_DIF | LEVEL | TYPE | TLC | ADMIN | ... | FCLASS_GR | FCLASS_IT | FCLASS_NL | FCLASS_SE | FCLASS_BD | FCLASS_UA | geometry | Cluster | Cluster_name | Color | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
175 | Admin-0 country | 1 | 5 | Trinidad and Tobago | TTO | 0 | 2 | Sovereign country | 1 | Trinidad and Tobago | ... | None | None | None | None | None | None | POLYGON ((-61.68000 10.76000, -61.10500 10.890... | 1 | Low | blue |
176 | Admin-0 country | 1 | 3 | South Sudan | SDS | 0 | 2 | Sovereign country | 1 | South Sudan | ... | None | None | None | None | None | None | POLYGON ((30.83385 3.50917, 29.95350 4.17370, ... | 4 | Missing data | gray |
2 rows × 172 columns
fig, ax = plt.subplots(figsize = (12,4))
world_clusters.plot(color = world_clusters['Color'], ax = ax)
colors = ['red','orange','green','blue', 'gray']
names = ['High','Medium early','Medium late','Low', 'Missing data']
custom_points = [Line2D([0], [0], marker="o", linestyle="none", markersize=6, color=color) for color in colors]
leg_points = ax.legend(custom_points, names, loc = 'lower left')
ax.add_artist(leg_points)
ax.set_axis_off()
plt.show()
There is a significant amount of missing data, but northern countries seem generally more affected than southern ones. The following section investigates this on other data points.
Temperature evolution over the last few decades¶
For each country, we compare the average temperatures in 1975 (start
) and 2018 (end
) by averaging over the nine surrounding years.
start = 1975 # min 1965
end = 2018 # max 2018
d = {'ISO3': db['ISO3'],
'Evolution': np.mean(db.loc[:, 'F'+str(end-4):'F'+str(end+4)], axis = 1) - np.mean(db.loc[:, 'F'+str(start-4):'F'+str(start + 4)], axis = 1)}
# As before, replace 'ADM0_A3' with 'iso_a3' if using the deprecated map
world_evol = pd.merge(world.rename(columns = {'ADM0_A3' : 'ISO3'}), pd.DataFrame(d), how = 'left')
fig, ax = plt.subplots(figsize = (12,4))
world_evol.plot('Evolution', cmap = 'coolwarm', legend = True, ax = ax, missing_kwds={'color': 'grey'})
ax.set_axis_off()
ax.set_title(f"Temperature gain since {start} (averaged over a decade)")
plt.show()
It seems that the highest rise in temperature is observed in the Northern Hemisphere.
Note that this does not mean that these are the countries that suffer the most from climate change: this is only about temperature changes, and does not include any other consequence of climate change. During COP27, the countries most vulnerable to climate change were listed and the top 10 are actually all in Africa, Middle-East or South Asia (https://www.iberdrola.com/sustainability/top-countries-most-affected-by-climate-change).