Predicting the Life Expectancy of a Country or Population
How can we predict the life expectancy of a country? What combined factors determine how long a group of people are going to live? There are many things we could point to in an attempt to predict life expectancy. There are many different factors that contribute to the life expectancy of a population. Some factors that might impact life expectancy include lifestyle, economic stability, access to medical care, access to a healthy diet, work life balance, social influences, and many other nuances of life.
This project seeks to explore a dataset provided on kaggle to answer the question-what is the impact of immunization related factors, mortality factors, economical factors and social factors on life expectancy.
Background Information: Columns/Features Names and Meaning
Country data is collected from
Year the data was collected
Whether the country is “Developed” or “Developing”
The Life Expectancy by Age
Number of Infant Deaths per 1000 population
Alcohol recorded per capita in liters of pure alcohol consumption(%)
Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
Expenditure on health as a percentage of Gross Domestic Product per capita(%)
Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
Number of reported measles cases per 1000 population
Average Body Mass Index of entire population
Number of under-five deaths per 1000 population
Polio (Pol3) immunization coverage among 1-year-olds (%)
General government expenditure on health as a percentage of total government expenditure (%)
Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
Deaths per 1000 live births from HIV/AIDS (0-4 years)
Gross Domestic Product per capita (in USD)
Population of the country
Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
Prevalence of thinness among children for Age 5 to 9(%)
Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
Number of years of Schooling
Step 1: Exploratory Data Analysis
A pandas profile report of the dataset shows the features with missing values, high cardinality features, and the features with a lot of zero’s. Here is an example of the information learned from the profile report.

Step 2: Choose evaluation metrics and target for dataset.
The next step is to choose what the target will be and the evaluation metrics to use for measuring how well the models are predicting the target.
The purpose for this project is using economic, social, and health factors to predict life expectancy. So, the target for this project is “Life Expectancy”.
This is a regression problem because the target is a continuous variable and is not defined into specific classes. For a regression problem, the appropriate metric has to be chosen.
The metrics chosen for this project are Mean absolute error and R2 Score.
Mean absolute error measures the average error between the predicted values and the actual values. Lower mean absolute error is better.
R2 Score explains the linear dependency between features. It tells how much of the variance in the dependent variable(target) is a result of the independent variables(features). Basically, how correlated is the features and the target. It is a score between 0 & 1. The higher the score means higher correlation between the features and the target. Higher scores is better.
Step 3: Establishing a Baseline
A baseline could be considered a guess. Basically, if we didn’t have the model and just had the dataset, what would we guess is the predicted life expectancy and what would be the mean absolute error of that guess? A baseline needs to be established so that we will know what to compare our models to.
One of the purposes of building a model is to create an algorithm that can make better predictions than guessing. If the model doesn’t outperform the baseline factors, then there is no point in using the model. The goal is to beat the baseline and we don’t know what baseline to beat unless we have a baseline to begin with.
For a regression problem, the mean of the target can be used as the baseline. Here are the baselines for this project.

Step 4: Split dataset into train, validation, and test sets
The dataset was split into 3 datasets. The training dataset for training the models, the validation dataset for validating the models to see how well the models can predict unseen data, and the test data set held out until the very end. The best model is used on the test dataset only once at the end. This 3 way dataset split is done to avoid model corruption in the way of testing the model with data it has already seen before.
The objective is not seeing how well the model can perform on data it was trained on, NO. The objective is to build a model that doesn’t overfit to training data, doesn’t underfit to the dataset, and can generalize in being able to make predictions using data it hasn’t previously seen. Overfitting is basically when the model is too tuned to the training data and can’t generalize well to data is hasn’t seen before. Underfitting is where the model is not good enough to establish a relationship between the features and the target.
Step 5: Data Wrangling
Data wrangling is a fancy word that means cleaning up the data before putting it in the model. Garbage in, garbage out. If the data is not properly cleaned and prepared for the model, then the model will not perform well. A few data cleaning procedures where taken to prepare this dataset for the models. Some procedures used to clean the data were:
- Removing data leakage features
- Dropping some missing rows.
- Filling in 0’s for some missing values
- And renaming the columns titles.
Removing data leakage features.
Data leakage is any feature in the dataset that will help the model cheat. Any feature that already has the answer to the question. Any feature that has data/information from the future. Data leakage is like giving the direct/indirect answer to the model before asking the model to predict the answers. Data leakage results in over optimistic models and inaccurate predictions. These are the data leakage features in this dataset and why they are considered data leakage features.
Income Composition of Resources – This feature is an index that includes multiple other features. It has information about life expectancy which is what is being predicted. Wikipedia defines income composition of resources as “The Human Development Index (HDI) is a statistic composite index of life expectancy, education, and per capita income indicators, which are used to rank countries into four tiers of human development. A country scores a higher HDI when the lifespan is higher, the education level is higher, and the gross national income GNI (PPP) per capita is higher.”
Percentage Expenditure feature has data that is composed of GDP data and total expenditure data. GDP stands for Gross Domestic Product, and total expenditure is general government expenditure on health as a percentage of total government expenditure (%). Percentage Expenditure is expenditure on health as a percentage of Gross Domestic Product per capita(%).
Infant Mortality could arguably be represented by the feature “under five deaths”. So, it could be considered duplicated information.
Under Five Deaths and Adult Mortality — Adult mortality is “The number of people that die between ages 15 and 60 yrs old per 1000 people” If this feature and under_five_deaths is in the predictive model, then the insight would be “In countries where more people die young, people don’t live as long on average.” This will leak data from the future into the model. The model shouldn’t already know at what age people are dying, the model is supposed to take other factors other than when people are dying to predict life expectancy.
These features income composition of resources, expenditure, infant mortality, under five deaths, and adult mortality where dropped from the dataset.
Dealing with missing data and zero’s.
A lot of GDP and Population data were missing for some countries. Due to the nature of the dataset, imputation with a mean or median strategy cannot be used. If imputation is used, then information from other countries would be taken and put in for a different country.
For example, taking population and GDP information from different countries and imputing it as the population and GDP of a different country. This will cause the population and GDP feature values to be inaccurate.
There are 2 possible options to deal with these missing data.
- Meticulously look up the information for the countries with missing data, but that would consume a lot of time, energy, and resources.
- Drop rows with missing GDP and population information. That would leave the dataset with 145 countries instead of 193 and 75% of the total data. By doing this, 25% of the data will be lost. The dataset is big enough that loosing 25% of the data won’t make much of a difference.
For this dataset, option 2 was chosen. Rows with missing GDP and Population data were dropped.
For other features like BMI, thinness, hepatitis B with missing values, 0’s were imputed for those missing values. if values are missing for those features, 0 is a safe assumption and won’t skew the data or disrupt the interpretation of the data too much.
XGboost can handle Nan, but other models might struggle to handle Nan, since imputation by mean or median is not a viable option due to the nature of the dataset, 0’s will be used in place of Nan’s.
Country and years column were dropped because the project was not trying to see if the life expectancy of a country changes from year to year. The country and year the data originated from should not be a predictor of life expectancy. Country is also a high cardinality feature and won’t have any impact on life expectancy prediction.
The objective is to use health, economic, and social factors to predict life expectancy. Having data from different countries over the years allows for diversity of information. The countries and years they were collected from doesn’t have any meaning to this particular project.
Step 6: Build and experiment with different models and choose the best model for this dataset.
The training, validation, and test data sets were split into (x and y) features and target.
With life expectancy being the target(y) and the features(x) comprising of every column except the dropped columns and the target.
For this regression problem, different regression models were tested to determine which model gives the best results. These are the models tested and their results.
Model Tested | Parameters | Encoding | Results using Validation data |
Linear Regression | Standard/Default | One Hot Encoding | Mean absolute error: 3.35 R2 score: 0.79 |
Ridge Regression | Alpha=10, default | One Hot Encoding | Mean absolute error: 3.35 R2 score: 0.79 |
Decision Tree Regressor | Random_state=42, default | Ordinal Encoding | Mean absolute error: 2.14 R2 score:0.86 |
Random Forest Regressor | random_state=42, n_estimators=100 | Ordinal Encoding | Mean absolute error: 1.5 R2 score: 0.95 |
Gradient Boosting Regressor | random_state=42, n_estimators=100 | Ordinal Encoding | Mean absolute error: 2.09 R2 score: 0.91 |
All the models tested beat the baseline scores but the best performing model in this experiment is the Random Forest Regressor Model. It gave the lowest mean absolute error score and the highest R2 score. This model was good at finding the most correlation between the target and the selected features. It also produced the lowest average error between the actual values and the predicted values.
Step 7: Explore Various model coefficients and feature importances.
The raw scores presented above doesn’t tell all the story. What if we could get an insight into how each feature is used in building the model and constructing the prediction? That is where feature importances and coefficients come in.
The graphs below shows the coefficients that has a positive and negative impact on the predicted value of life expectancy. Feature importances graphs shows which features has the most impact on the predicted value. Linear and ridge regression has coefficients, but tree based models don’t have coefficients. Rather, they have something called feature importances.
Linear Regression Coefficients plotted on a bar graph to visualize how each feature positively or negatively impacts the predicted value of life expectancy. Some features contribute positively and others negatively.

Random forest feature importances plotted on a bar graph to visualize the impact of each feature on the predicted value. It shows us which features the model is using to make it’s decisions.

Ridge Regression Coefficients plotted on a bar graph to visualize how each feature positively or negatively impacts the predicted value of life expectancy. Similar to Linear Regression Coefficients

Decision tree feature importances plotted on a bar graph to visualize the impact of each feature on the predicted value. It shows us which features the model is using to make it’s decisions.

Gradient boosting feature importances plotted on a bar graph to visualize the impact of each feature on the predicted value. It shows us which features the model is using to make it’s decisions.

Step 8: Use permutation importance to help explain the results of the model
From this point forward, the focus will be on the best performing model which is the Random Forest Model. Feature importance are helpful for quickly visualizing the most impactful features, but they can be deceptive. Feature importances doesn’t properly represent the scale and magnitude of each feature contribution. That is where permutation importances come in.
Permutation importances allow us to see how much the R2 score decreases in the absent of the feature of interest. Basically, when a particular feature is removed, how much is the score decreased. The magnitude of the decrease in score can give us insight into how much each features impacts the model. Permutation importances tells us how much weight each feature carries. This is slightly different from feature importances because, feature importances tells us the significant features but not how significant they are. While permutation importances tells us the significance level of each feature.
This is the permutation importance weights from the random forest model.

This is what the permutation importance means.
- The absence of Death by HIV AIDs data causes a 56% drop in R2 score,
- The absence of Number of years in school data causes a 23% drop in R2 score,
- The absence of Body Mass Index(BMI) data causes a 6% drop in R2 score, e.t.c
The +- number is showing the standard deviation for each feature importance. It is showing the error in each weight prediction.
This technique is explaining that deaths by HIV AIDS is the most important feature followed by number of years in school, and then the body mass index, and so fourth. This permutation importance results is consistent with the results from the random forest feature importance visualization above. With Death by HIV Aids being the main predictor sequentially followed by schooling, and BMI.
Step 9: Use single feature and 2 features Partial Dependence Plots(PDP) box plots to help explain random forest model.
Partial dependence plots helps us to explain how the predicted value is partially dependent on different features. It tells us in which direction each feature is influencing the predicted values. Partial dependence plots also helps to show non-monotonic relationship between each feature and the predicted value.
These are the single feature partial dependence plots from this dataset. Single feature dependence plots are exploring the relationship between one feature and the predicted value.
This single feature partial dependence plot from random forest model is showing how life expectancy is partially dependent on number of people that die from HIV AIDS per 1000 people. There seems to be a negative, monotonic correlation shown between life expectancy and death by HIV/AIDS. The more people die from HIV/AIDS from the ages of (0-4) per 1000 live births, the lower the life expectancy.

This is also a scatter plot exploring the relationship between deaths by HIV AIDS and life expectancy. This 2D graph from the raw data without using random forest model also shows the same trend as the pattern the model discovered. Correlation is not causation. Death by HIV AIDS is not the causation for lower life expectancy, but rather a contributing factor.
Hover over the graph to explore it more.
This single feature partial dependence plot from random forest model is showing how life expectancy is partially dependent on number of years spent in school. There seems to be a positive, then monotonic, relationship between life expectancy and number of years in school. This graphs is showing higher education level could mean higher life expectancy up until a certain point, then higher education has no effect on life expectancy.

This is also a scatter plot exploring the relationship between education level and life expectancy. This 2D graph from the raw data without using random forest model also shows the same trend as the pattern the model discovered. Correlation is not causation. Level of education is not the causation for higher or lower life expectancy, but rather a contributing factor. Hover over the graph to explore it more.
This single feature partial dependence plot from random forest model is showing how life expectancy is partially dependent on the Body Mass Index(BMI). There appears to be a non-monotonic relationship between life expectancy and Body Mass Index. Initially, BMI has no impact on life expectancy, then rising BMI could indicate a growing young adult and this seems to positively impact life expectancy. But, a BMI higher than 68 begins to have a negative impact on life expectancy.

This is also a scatter plot exploring the relationship between Body Mass Index and life expectancy. This 2D graph from the raw data without using random forest model also shows the same trend as the pattern the model discovered. Correlation is not causation. Higher or lower BMI is not the causation for lower or higher life expectancy, but rather a contributing factor.
Hover over the graph to explore it more.
Two features dependence plots are exploring the relationship between two features and the predicted value.
This 2 feature dependence plot is exploring the relationship between HIV AIDS, BMI and life expectancy. The way to interpret this plot is:
if hiv aids = 0.1%, BMI = 74.6, then predicted live expectancy would be 73.05yrs.

This 2 feature dependence plot is exploring the relationship between schooling, BMI and life expectancy. The way to interpret this plot is:
if schooling = 7.6yrs, BMI = 74.6, then predicted live expectancy would be 66.56yrs.

This 2 feature dependence plot is exploring the relationship between HIV AIDS, schooling and life expectancy. The way to interpret this plot is:
if hiv aids = 0.1%, schooling = 18.7yrs, then predicted live expectancy would be 75.81yrs.

So far, this data analysis has been explored in 2D, let’s now explore it in 3D to experience how different features interact with each other in the 3rd dimension.
This 3d graph explores the relationship between death by HIV AIDS, number of years in school, and life expectancy. This graph was done using raw data without the random forest model. Hover over the graph and play with it.
This 3d graph explores the relationship between death by HIV AIDS, number of years in school, and life expectancy. This is the 3d graph from the random forest model. Hover over the graph and play with it.
This 3d graph explores the relationship between death by HIV AIDS, Body Mass Index, and life expectancy. This graph was done using raw data without the random forest model. Hover over the graph and play with it.
This 3d graph explores the relationship between death by HIV AIDS, Body Mass Index, and life expectancy. This graph was done using raw data without the random forest model. Hover over the graph and play with it.
This 3d graph explores the relationship between number of years in school, Body Mass Index, and life expectancy. This graph was done using raw data without the random forest model. Hover over the graph and play with it.
This 3d graph explores the relationship between number of years in school, Body Mass Index, and life expectancy. This graph was done using raw data without the random forest model. Hover over the graph and play with it.
Step 10: Use Shapley Plots to help explain the models.
SHapley Additive exPlanations (SHAP) values break down a prediction to show the impact of each feature. The previous techniques used to explain the model so far has been focused on general insights. SHAP values break down how the model works for each individual prediction.
This is some SHAP plots for high and low life expectancy.


The image above shows you a visual of how the SHAP values generated from the random forest model combines forces to create a prediction. It shows how the values for different features contribute to the final prediction.The image on the left are the raw SHAP values while the image above is a representation of the raw SHAP values. Some features contribute to higher life expectancy while others contribute to lower life expectancy.
Read the text below to learn how to interpret this SHAP values…


For this particular row in the dataset, this is how to interpret the prediction using the SHAP values. Remember our baseline value was 68.7yrs which is also the mean of the target column (life expectancy).
**status** is a categorical feature. 1= developed countries, 2 = developing country.
Read the text below to learn how to interpret this SHAP values…


Let’s use this one to interpret the SHAP values. Beginning with a baseline value of 68.7yrs, if
- Status = 1, this adds 0.44 to 68.7yrs =
- alcohol = 7.81 ltrs, adds 0.17 to the number of years
- hiv aids = 0.1%, adds 4.69 to the number of years
- school = 18.7yrs, adds 4.07 to the number of years
- measles = 0.0, subtracts 0.15 from the number, E.T.C


Techniques like SHAP values, 1 & 2 feature partial dependence plots , feature importances, help us to demystify black box algorithms and explain our model’s decisions. All the information we learn from these model interpretation techniques is provided by coefficients of linear models. But, sometimes linear models is not the best model for the job and they are not accurate enough. Now with these model interpretation techniques, we can have the accuracy and power of complex models and still gain the interpretability of simpler models.
Conclusions:
There is no single predictor of life expectancy. There are things that contribute to life expectancy that is not in this dataset. But, based on the data used in this project, multiple factors greatly affects the life expectancy. The factors that had the most impact on life expectancy are number of deaths from HIV AIDS, number of years in school, and the body mass index of the population.
Using these factors, one can predict the life expectancy of a population using health, social, and economic variables. It would be a stretch to use this information to predict the life expectancy of an individual because there are many more life variables involved than the variables presented in this project. This project and data analysis is most useful at predicting life expectancy at the population level.
Leave a Comment