title: "Simulation" date: 2021-04-17T00:29:16Z draft: true author : Angela C year: "2021" month: "2021/04" categories:
tags:
This notebook contains the body of the work for my submission for the Programming in Data Analysis Project 2019 as part of the Higher Diploma in Data Analytics at GMIT.
The problem statement from the Programming for Data Analysis Project 2019 instructions [1] is as follows:
For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose. Specifically, in this project you should:
The pdf document is included in this repository for reference.
This project was mainly developed using the python [2] and the following packages:
NumPy
package for working with random numbers. NumPy is one of the most important packages for numerical and scientific computing in Python.The end goal of this project is to simulate a real-world phenomenon across at least one hundred data points across at least 4 different variables. A dataset must be simulated or synthesised. The instructions note that it is ok to base the synthesised dataset on an actual real-world dataset but the main task is to create a synthesised data set.
The real-world phenomenon I have chosen is the World Happiness Score, in particular the main determinants of happiness at country levels across the world as reported in the World Happiness Report [6].
The variables on which the national and international happiness scores are calculated are very real and quantifiable. These include socio-economic indicators such as Gross Domestic Product (GDP), life expectancy as well as some life evaluation questions regarding freedom, perception of corruption, family or social support. Differences in social support, incomes and healthy life expectancy are the three most important factors in determining the overall happiness score according to the World Happiness Reports.
The aim of the World Happiness report is to see what countries or regions rank the highest in overall happiness and each of the six factors contributing to happiness. Over the years the reports looked at how country ranks and scores changed and whether any country experienced a significant increase or decrease in happiness.
The researchers studied how 6 different factors contribute to the happiness scores and the extent of each effect. These are economic production, social support, life expectancy, freedom, absence of corruption, and generosity. They looked at how these factors contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. While these factors have no impact on the total score reported for each country, they were analysed to explain why some countries rank higher than others. These factors describe the extent to which these factors contribute in evaluating the happiness in each country.
First I will investigate the variables in the datasets used by the World Happiness Reports. I will study their distributions by looking at descriptive statistics and plots such as histograms and boxplots. I will explore relationships that may exist between the variables using visualisations such as scatterplot, pairplots etc and statistics such as correlation and covariance statistics.
With this information in mind, I will try to create a simulated dataset that is as close to the real world phenomenon as possible.
Having studied the distributions of the real dataset by looking at statistics and plot I will use Python to simulate the data, focusing on using the numpy.random
package as much as possible but using other Python libraries as may be required. I will look at how simulation is performed and what must be considered when simulating a dataset such as this one. I will look at how each of the variables are distributed and how they could be simulated. I will also consider the relationships between the variables. While it might be relatively straightforward to simulate a single variable, modelling the real-world correlations between the variables will be challenging.
As there is much inequality in the world, this will be reflected in the distribution of the variables that model factors such as income and life expectancy. Therefore I will need to look at regional variations and how this would affect the simulation of data. The distributions are likely to vary between regions, particularly between the lesser developed countries and the countries of the more developed world.
All the analysis will be documented in this notebook which means that the document is quite lengthy and might take some time to load. The first section of code involves reading in the real world dataset and getting it into a state where it is ready to be analysed. The data is available in excel and csv files which is left unchanged. As the files containing the 2018 did not include the geographic regions of the countries studied, I had to add these to the data by merging with an earlier dataset. Some other manipulation such as renaming columns, dropping unnecessary columns, adding region codes etc is documented below. The end result of this is written to a csv files to the data folder in this repository.
The goad of this project is to simulate a dataset. Simulating data is used for a number of reasons. Monte carlo simulation are used to simulate real world problems using repeated random sampling while simulated data is very useful for learning and demonstration purposes. Data can be simulated before the real world data is collected to help identify the type of tests and programs that need to be run. Collecting data requires resources of time and money whereas data can be simulated easily using computer programs.
Statistical analysis can be performed on the simulated data in advance of collecting the real data this process can be repeated as many times as needed. By studying simulated data you can become more familiar with the different kinds of data distributions and be in a better position to make decisions about the data and what to do with it such as how to measure it and how much is required. Simulations produce multiple sample outcomes. Experiments can be run by modifying inputs and seeing how this changes the output. The process of generating a random sample can be repeated many many times which will allow you to see how often you would expect to get the outcomes you get. Repeating the process gives multiple outcomes which can then be averaged across all simulations.
When data is collected, it is often only a small sample of data from the overall population of interest. The researchers of the World Happiness Reports did not collect all the data about the variables of interest. The typical sample size used per country was 1000 people while some countries had more than one survey per year and others had less. A sample is a subset of numbers from a distribution and the bigger the sample size the more it resembles the distribution from which it is drawn. Depending on the distribution the data is drawn from, some numbers will occur more often than others. Sample statistics are descriptions of the data that can be calculated from the sample dataset and then be used to make inferrences about the population. The population parameters are of most interest. These are the characteristics of the actual population from which a sample dataset is taken. Samples are used to estimate the parameters of the population. The sample mean is the mean of the numbers in the sample while the population mean is the mean of the entire population but it is not always possible to study the entire population directly. The law of large numbers refers to how as a sample size increases, the sample mean gets closer to the true population mean. Under the law of large numbers the more data that is collected, the closer the sample statistics will get to the actual true population parameters.
The sampling distribution of the sample means is when you collect many samples from the population and calculate the sample means on each sample. If you know the type of distribution you could sample some data from this distribution, calculate the means or any other sample statistic of the samples and plot them using a histogram to show the distribution of the sample statistic. The sampling distributions can tell you what to expect from your data.
Simulation can be used to find out what the sample looks like if it comes from that particular distribution. This information can be used to make inferences about whether the sample came from particular distribution or not. The sampling distribution of a statistic varies as a function of sample size. Small sample taken from the distribution will probably have sample statistics such as sample means that vary quite a bit from sample to sample and therefore the sampling distribution will be quite wide. Larger samples are more likely to have similar statistics and a narrower sampling distribution.
As the size of the samples increases, the mean of the sampling distribution approaches the mean of the population. The sampling distribution is itself a distribution and has some variance. The standard deviation of the sampling distribution is known as the standard error. As the sample size increases, the standard error of the sample mean decreases. According to the central limit theorem, as the sample size increases the sampling distribution of the mean begins to look more like a normal distribution, no matter what the the shape of the population distribution is.
Large experiments are considered more reliable than smaller ones. If you take a big enough sample, the sample mean gives a very good estimate of the population mean.
When simulating a random variable, you first need to define the possible outcomes of the random variable. To do this you can use the sample statistics from the sample dataset. Using simulated data therefore allows you to identify coding errors as you know what the outcomes should be.
Resampling methods are another way of simulating data and involve resampling with replacements. Bootstrap resampling is the most common method.
For this section I referred to an online book called Answering Questions with Data (Textbook): Introductory Statistics for Psychology Students by Matthew J C Crump [6].
Having identified a real-world phenomenon to simulate, the next step is to investigate the types of variables involved, their likely distributions, and their relationships with each other. To do so I need data.
The World Happines Report is produced by the United Nations Sustainable Development Solutions Network in partnership with the Ernesto Illy Foundation. The first World Happiness Report was published in 2012, the latest in 2019. The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. Each year the report has focused in on a different aspect of the report such as how the new science of happiness explains personal and national variations in happiness and how well-being is a critical component of how the world measures its economic and social development. Over the years it looked at changes in happiness levels in the countries studies and the underlying reasons, the measurement and consequences of inequality in the distribution of well-being among countries and regions. The 2017 report emphasized the importance of the social foundations of happiness while the 2018 report focused on migration. The latest World Happiness Report (2019) focused on happiness and the community and happiness has evolved over the past dozen years. It focused on the technologies, social norms, conflicts and government policies that have driven those changes.
Increasingly, happiness is considered to be the proper measure of social progress and the goal of public policy. Happiness indicators are being used by governments, organisations and civil society to help with decision making. Experts believe that measurements of well-being can be used to assess the progress of nations. The World Happiness reports review the state of happiness in the world and show how the new science of happiness explains personal and national variations in happiness.
The underlying source of the happiness scores in the World Happiness Report is the Gallup World Poll - a set of nationally representative undertaken in many countries across the world. The main life evaluation question asked in the poll is based on the Cantril ladder. Respondents are asked to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale. The rankings are from nationally representative samples, for the years 2016-2018. The overall happiness scores and ranks were calculated after a study of the underlying variables.
Happiness and life satisfaction are considered as central research areas in social sciences.
The variables on which the national and international happiness scores are calculated are very real and quantifiable. These include socio-economic indicators such as gdp, life expectancy as well as other life evaluation questions regarding freedom, perception of corruption, family or social support. Differences in social support, incomes and healthy life expectancy are the three most important factors in determining the overall happiness score according to the World Happiness Reports.
The variables used reflect what has been broadly found in the research literature to be important in explaining national-level differences in life evaluations. Some important variables, such as unemployment or inequality, do not appear because comparable international data are not yet available for the full sample of countries. The variables are intended to illustrate important lines of correlation rather than to reflect clean causal estimates, since some of the data are drawn from the same survey sources, some are correlated with each other (or with other important factors for which we do not have measures), and in several instances there are likely to be two-way relations between life evaluations and the chosen variables (for example, healthy people are overall happier, but as Chapter 4 in the World Happiness Report 2013 demonstrated, happier people are overall healthier).
The World Happiness Reports and data are available from the Worldhappiness website. The latest report is The World Happiness Report 2019[7]. The World Happiness Report is available for each year from 2012 to 2019 containing data for the prior year. For each year there is an excel file with several sheets including one sheet with annual data for different variables over a number of years and other sheets containing the data for the calculation of the World Happiness score for that year. Some of the data such as Log GDP per capita are forecast from the previous years where the data was not yet available at the time of the report. Kaggle also hosts part of the World Happiness datasets for the reports from 2015 to 2019.
The full datasets are available online in excel format by following a link under the Downloads section on the World Happiness Report[8] website to https://s3.amazonaws.com/happiness-report/2019/Chapter2OnlineData.xls.
Kaggle Datasets[9] also has some datasets in csv format.
I have downloaded both the latest csv and excel files to the /data
folder in this repository. The data from the excel file is the most comprehensive as it includes data for several years from 2008 to 2019. There are two main excel sheets in the 2019 data. The sheet named Figure2.6
contains the main variables used as predictors of the happiness score in the 2019 World Happiness Report. The data is contained in columns A to K while the remaining columns contain actual tables and figures used in the World Happiness annual report. A second sheet called Table2.1
contains all the available data from 2008 up to 2019 and provides more detail.
The happiness scores and the world happiness ranks are recorded in Figure 2.6 of the 2019 report with a breakdown of how much each individual factor impacts or explains the happiness of each country studied rather than actual measurements of the variables. The actual variables themselves are in Table 2.1 of the World Happiness Report. There are some other variables included in the report which have a smaller effect on the happiness scores.
In this project I will focus on the main determinants of the Happiness scores as reported in the World Happiness reports. These are income, life expectancy, social support, freedom, generosity and corruption. The happiness scores and the world happiness ranks are recorded in Figure 2.6 of the 2019 report with a breakdown of how much each individual factor impacts or explains the happiness of each country studied rather than actual measurements of the variables. The columns in df6
dataframe correspond to the columns in Figure 2.6 data from the World Happiness Report of 2019. The values in the columns describe the extent to which the 6 factors contribute in evaluating the happiness in each country.
The actual values of these variables are in Table 2.1 data of the World Happiness Report. There are some other variables included in the report which have a smaller effect on the happiness scores.
Life Ladder The variable named Life Ladder is a Happiness score or subjective well-being from the Gallup World Poll It is the national average response to the question of life evaluations. The English wording of the question is “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?” This measure is also referred to as Cantril life ladder. 10 The values in the dataset are real numbers representing national averages. They could vary between 0 and 10 but in reality the range is much smaller in between these numbers.
Log GDP per capita:
GDP per capita is a measure of a country's economic output that accounts for its number of people. It divides the country's gross domestic product by its total population. That makes it a good measurement of a country's standard of living. It tells you how prosperous a country feels to each of its citizens.11
Gross Domestic Product per capita is an approximation of the value of goods produced per person in the country, equal to the country's GDP divided by the total number of people in the country. It is usually expressed in local currency or a standard unit of currency in international markets such as the US dollar. GDP per capita is an important indicator of economic performance and can be used for cross-country comparisons of average living standards. To compare GDP per capita between countries, purchasing power parity (PPP) is used to create parity between different economies by comparing the cost of a basket of similar goods.
GDP per capita can be used to compare the prosperity of countries with different population sizes.
There are very large differences in income per capita across the world. As average income increases over time the distribution of gdp per capita gets wider. Therefore the log of income per capita is taken when the growth is approximately proportional. When $x (t)$ grows at a proportional rate, $log x (t)$ grows linearly. [12]Introduction to Economic Growth Lecture MIT.
Per capita GDP is a unimodal but skewed distribution. The log of GDP per capita = log (Total GDP per capita/ population) is a more symmetrical distribution. [13]Sustainable Development Econometrics Lecture
The natural log is often used in economics as it can make it easier to see the trends in the data and the log of the values can fit a normal distribution.
Healthy life expectancies at birth are based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository.
Healthy life expectancy (HALE) is a form of health expectancy that applies disability weights to health states to compute the equivalent number of years of good health that a newborn can expect. The indicator Healthy Life Years (HLY) at birth measures the number of years that a person at birth is still expected to live in a healthy condition. HLY is a health expectancy indicator which combines information on mortality and morbidity. [14]Eurostat.
Overall, global HALE at birth in 2015 for males and females combined was 63.1 years, 8.3 years lower than total life expectancy at birth. In other words, poor health resulted in a loss of nearly 8 years of healthy life, on average globally. Global HALE at birth for females was only 3 years greater than that for males. In comparison, female life expectancy at birth was almost 5 years higher than that for males. HALE at birth ranged from a low of 51.1 years for African males to 70.5 years for females in the WHO European Region. The equivalent “lost” healthy years (LHE = total life expectancy minus HALE) ranged from 13% of total life expectancy at birth in the WHO African Region to 10% the WHO Western Pacific Region. [15]www.who.int.
Here I load the python libraries and set up some formatting for the notebook.
# import python libraries using common alias names
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# display matplotlib plots inline
%matplotlib inline
# check what version of packages are installed.
print("NumPy version",np.__version__, "pandas version ",pd.__version__, "seaborn version",sns.__version__ ) # '1.16.2'
# set print options with floating point precision if 4, summarise long arrays using threshold of 5, suppress small results
np.set_printoptions(precision=4, threshold=5, suppress=True) # set floating point precision to 4
# set options to display max number of rows
pd.options.display.max_rows=8
# I did change the max number of rows to display when developing the project to see more rows as necessary.
import warnings
warnings.filterwarnings('ignore')
#!ls data # see what files are in the data folder of this repository
Here I load some real-world dataset for the World Happiness Report into python. The data is available in the data
folder of this repository which is what I use for this project. The dataset can also be read in directly from the URL.
Data for other years is also available by following the links from the World Happiness Report website.
# read the data directly from the url or alternatively from the data folder in this repository
url="https://s3.amazonaws.com/happiness-report/2019/Chapter2OnlineData.xls"
# The entire data from Table2.1 sheet
WH = pd.read_excel(url, sheet_name='Table2.1')
# The data from the sheet Figure2.6, columns A to K
whr18 = pd.read_excel(url,sheet_name='Figure2.6', usecols="A:K")
# import the data from the data folder in this repository
Table2_1 = pd.read_excel('data/Chapter2OnlineData2019.xls', sheet_name='Table2.1', usecols="A:I")
# The data from the sheet Figure2.6, selected columns between A and K
Fig2_6 = pd.read_excel('data/Chapter2OnlineData2019.xls',sheet_name='Figure2.6', usecols="A:B, F:K")
# read in only selected columns from Figure2.6
Fig2_6_x = pd.read_excel('data/Chapter2OnlineData2019.xls',sheet_name='Figure2.6', usecols="A:B")
# the 2019 data, same values as Kaggle data except not including the rank but including the whiskers or intervals
print("The shape of the data from Table2.1 is: \n",Table2_1.shape)
print("The shape of the data from Figure 2.6 is: \n",Fig2_6.shape)
I have read in the data from the excel sheet by selecting certain columns. There are 1704 rows in the excel sheet Table 2.1
and 156 rows in sheet Figure2.6
.
Table 2.1 contains the data on which the happiness scores in Figure 2.6 of the World Happiness Report for 2019 are calculated. When the Table2.1 data is filtered for 2018 data there are only 136 rows. The report does note that where values were not yet available at the time, values were interpolated based on previous years.
Here I am going to merge the two spreadsheets and just filter for the columns I need. I then need to cleanup the dataframe names throoughout this project.
#Table2_1.columns
# look at columnes for Figure 2.6 data
#Fig2_6.columns
To make the dataframe easier to work and also to merge the two dataframes I will rename the columns.
# only contains the country and happiness score for 2018
Fig2_6_x.columns
# rename the columns
Table2_1.rename(columns={'Country name':'Country', 'Life Ladder':'Life_Satisfaction','Log GDP per capita':'Log_GDP_per_cap',
'Social support':'Social_Support', 'Healthy life expectancy at birth':'Life_Expectancy','Freedom to make life choices':'Freedom',
'Perceptions of corruption':'Corruption_perceptions'}, inplace=True)
# look at the columns
#Table2_1.columns
# merge the two dataframes that came from the two spreadsheets on the Country variable.
merged= pd.merge(Fig2_6_x,Table2_1, on="Country",how="right")
# as the figure 2.6 data is for 2018 only and I am only interested in the Happiness Score I will chnage the column name and drop all the other columsn from Figure 2.6
merged =merged.rename(columns={'Happiness score':'Happiness_Score_2018'})
The World Happiness Report report does note somewhere that for the world as a whole, the distribution of happiness is normally distributed but when the global population is split into ten geographic regions, the resulting distributions vary greatly in both shape and average values.
Therefore I think it is important to look at the geographic regions when studying the properties of this dataset. The region data is not available in the data available from the World Happiness Report but it was included in the csv file for 2015 and 2016 on Kaggle in addition to the country. (For some reason the datasets for later years on Kaggle had a single Country or region
variable that only contained the country).
The following segment of code is used to add the geographic regions to the data files I am using for this project. First I had to rename the columns containing the country and then using pandas merge
to join the dataframes based on the country names being equal to get the geographic regions into my dataset. This dataframe was then written to csv file.
For merging dataframes I referred to the pandas documentation and a blogpost by Chris Albon,pandas merge with a right join[16].
A Region
variable has been added to the data in order to look at the distributions across regions.
# read in the kaggle dataset for 2015 as this contains the Regions as well as the country names.
k15 = pd.read_csv("data/2015.csv")
# extract the country and regions from the 2015 file:
Regions = k15.loc[:,['Country','Region']]
#Regions.shape # 158 rows
# see how many unique regions and countries
Regions.describe(include="object")
#merge the dataframes on country variable
merged=pd.merge(Regions, merged,on='Country',how='right')
#merged.describe()
# There should now be two non-numeric variables in the dataset. Country name and Region
#merged.describe(include="object")
# a peak at the dataset
print("\t DataFrame merged")
merged.head(3)
I now have a dataframe merged
containing the data from Table 2.1
in the World Happiness Report of 2019 with the geographic regions added as well as the 2018 Happiness Score from the Figure 2.6
data.
However some countries were missing a region value as these countries were not included in the 2015 dataset. In this case I looked up the geographic region and added them to the dataframe to replace the NaN values.
# write to csv
merged.to_csv("data/merged.csv")
# create a dataframe called df from the merged dataframe
df=merged
# find how many rows have missing values for Region
df['Region'].isna().sum()
# print the rows with missing Region valuesBelize
df.loc[df.loc[:,'Region'].isna()]
# update the value of Region for the following countries
df.loc[df['Country']=="Taiwan Province of China",['Region']]="Southeastern Asia"
df.loc[df['Country']=="Namibia",['Region']]="Sub-Saharan Africa"
df.loc[df['Country']=="Somalia",['Region']]="Sub-Saharan Africa"
df.loc[df['Country']=="Hong Kong S.A.R. of China",['Region']]="Southeastern Asia"
df.loc[df['Country']=="South Sudan",['Region']]="Sub-Saharan Africa"
df.loc[df['Country']=="Gambia",['Region']]="Sub-Saharan Africa"
df.loc[df['Country']=="Belize",['Region']]="Latin America and Caribbean"
df.loc[df['Country']=="Cuba",['Region']]="Latin America and Caribbean"
df.loc[df['Country']=="Guyana",['Region']]="Latin America and Caribbean"
# checking to make sure all regions have values now
df['Region'].isna().sum()
I added in a region code but this is no longer of use. I will leave it in for now but will remove it later.
To create the region code I looked at the proportion of the overall number of countries that are in each region and then created a new column with the region number. The numbers assigned are not ordered as such - I just started with 1 for the region with the greatest number of countries.
To do this I will add a new column for the RegionCode that has the Region name to start and then replace the string with the Region name in this new column with an integer from 1 to 10. (There is probably a better way of doing this - where
did not work for me when there was 10 options to choose from).
First I look at how the countries are split between geographic regions and then assign a number from 1 to 10 for the region with the highest number of countries although there is no order as such.
# just looking again at the proportion of countries over the 10 regions
df.Region.value_counts()/len(df.Region)
# add a new column called RegionCode with the Region
df['RegionCode']=df['Region']
df.head()
# replace the new regionCode with a number for each region as follows:
df['RegionCode']=df["RegionCode"].replace("Sub-Saharan Africa",1)
df['RegionCode']=df["RegionCode"].replace("Central and Eastern Europe",2)
df['RegionCode']=df["RegionCode"].replace("Western Europe",3)
df['RegionCode']=df["RegionCode"].replace("Latin America and Caribbean",4)
df['RegionCode']=df["RegionCode"].replace("Middle East and Northern Africa",5)
df['RegionCode']=df["RegionCode"].replace("Southeastern Asia",6)
df['RegionCode']=df["RegionCode"].replace("Southern Asia",7)
df['RegionCode']=df["RegionCode"].replace("Eastern Asia",8)
df['RegionCode']=df["RegionCode"].replace("Australia and New Zealand",9)
df['RegionCode']=df["RegionCode"].replace("North America",10)
# convert to integer - might use this later for correlations.
#df["RegionCode"] = pd.to_numeric(dfh["RegionCode"])
# write the dataframes to a csv files:
df.to_csv("data/Table2_1.csv")
Note that Table 2.1
data includes some rows where there are some missing values as there were some countries added to the World Happiness Report in recent years for which the data was not available.
Also some of the data in Table 2.1
was not available for 2018 at the time of the 2019 report being published. Some imputation was used or some interpolation from previous years values.
Statistical Appendix 1 for Chapter 2[10] of the World Happiness Report for 2019 outlines how imputation is used for missing values when trying to decompose a country's average ladder score into components explained by the 6 hypothesized underlying determinants (GDP per person, healthy life expectancy, social support, perceived freedom to make life choice, generosity and perception of corruption).
All the data I am using to figure out the distribution of variables have now been prepared
# read in the Table 2.1 data back in, set the index_col to be the first column
df = pd.read_csv("data/Table2_1.csv", index_col=0)
# look at top and bottom rows to see it all looks ok
df.tail(2)
df.head(2)
# Create a dataframe with 2018 data from the Table 2.1 of the World Happiness Report 2019:
df18=df.loc[df.loc[:,'Year']==2018]
print("\n \t The dataframe df18 containing the Table 2.1 data from the 2019 World Happiness Report.\n")
df18.head(2)
# see the dimensions of the dataframes
print(df.shape,df18.shape) #
Here I first create a smaller dataframe from df18 called dfh containing the variables of interest. This is just to make the dataframes easier to work with and print. The initial work on this project was done looking only at 2018 data. However there are few missing values that were not available at the time. Therefore I create another dataframe containing only the relevent variables for each year from 2012 to 2018.
Below I subset the larger file (from Table 2.1 data in World Happiness Report 2019 which contains data for several years prior to 2018) to include only the main variables of interest for this project.
# drop year column as this only contains 2018 anyway.
## create a data frame dfh that contains only a subset of the columns from df
df18 = df18.loc[:,['Country','Region','Life_Satisfaction','Log_GDP_per_cap','Social_Support','Life_Expectancy', 'RegionCode']]
df18.head()
# using .loc to create a dataframe containing only the required columns for this project
dfx= df.loc[:,['Year','Country','Region','Life_Satisfaction','Log_GDP_per_cap','Social_Support','Life_Expectancy', 'RegionCode']]
# using .loc to subset only the rows contains years from 2011 to 2018.
df_years = dfx.loc[dfx.loc[:,'Year'].isin([2018,2017,2016,2015,2014,2013,2012,2011])]
# look at the columns
df_years.columns
print("\t DataFrame called df_years \n")
df_years.head()
The dataframe calleddf_years
contains all the data from Table 2.1 of the 2019 World Happiness Report data for the years between 2011 and 2018. There are some missing values for some columns. I will write this to csv. To see only the rows containing missing data I use isnull().any(axis=1)
as per stackoverflow[17]. There are not many missing values overall.
# to see how many observations for each year
#df_years.groupby('Year').count()
# write the csv file to csv
df_years.to_csv("data/WHR_2012_2018.csv")
# read back in the csv file.
df_years= pd.read_csv("data/WHR_2012_2018.csv", index_col=0)
df_years.head()
# look at only rows with missing values: https://stackoverflow.com/a/30447205
# any(axis=1) to check for any missing values per row, then filter with boolean mask
nulldata=df_years[df_years.isnull().any(axis=1)]
# see how rows with missing values
print(f"There are {nulldata.shape[0]} rows with missing values out of a total of {df_years.shape[0]} rows for data from 2011 to 2018.")
In order to be able to simulate data I need to know more about the data and what it looks like. I will go through each of the variables in a sub-section of their own but I will first look at the summary statistics and some visualisations of the distribution of the most important variables in the datasets. I will look a the similarities in the data and the differences in the data.
The distributions can be plotted to summarise the data visually using histograms and kernel density estimates plots. The histogram plots the distribution of the frequency counts across the bins. Distributions can have very different shapes and describe the data. Histograms can show how some of the numbers group together, the location and shape of the data. The height of the bars on a histogram indicate how much data there is, the minimum and maximum values show the range of the data. The width of the bars can be controlled by changing the bin sizes. Kernel Density Estimation[18] is a non-parametric way to estimate the probability density function of a random variable. Inferences about the population are made, based on a finite data sample.
The central tendency measures shows how common certain numbers are, how similar data points are to each other and where the most data tends to be located while the variance shows the spread of the data and how different the data points are.
Correlation is not the same as causation while lack of an obvious correlation does not mean there is no causation. Correlation between two variables could be due to a confounding or third variable that is not directly measured. Correlations can also be caused by random chance - these are called spurious correlations. These are all things to consider when looking at data and when attempting to simulate data.
A seaborn pairplot
can show if there are any obvious relationships between variables. The univariate distribution of each variable are shown on the diagonal. The pairplot below is for the data for 2018. The scatterplots in the pairplot shows a positive linear relationship between each of the variables Income (Log GDP per capita), Healthy Life Expectancy, Social support and satisfactions with life (Life Ladder). While the life ladder distribution looks more normally distributed than the other variables it has 2 distinct peaks. The other distributions are left skewed. The 2016 World Happiness Report[10] notes how for the world as a whole, the distribution of world happiness is very normally distributed about the median answer of 5, with the population-weighted mean being 5.4 but when the global population is split into ten geographic regions, the resulting distributions vary greatly in both shape and average values. Only two regions—the Middle East and North Africa, and Latin America and the Caribbean— have more unequally distributed happiness than does the world as a whole.
Therefore taking this into account it is important to look at the data on a regional basis. The second pairplot uses the hue
semantic to colour the points by region and shows the distributions for each variables for each region on the diagonal. The distribution plots show very different shapes for each variable when broken down by regions. Sub-Saharan Africa stands out as a region that has far different levels of life satisfaction, income and life expectancy than most other regions.
Regions such as Western Europe, North America and Australia and New Zealand have distributions that are centred around much higher values than regions such as Sub-Saharan Africa and South Asia. The distributions for North America, Australia and New Zealand are very narrow as these regions contain a very small number of countries compared to the other regions.
On a regional level the distributions look more normally distributed for all the four variables but with different locations and spreads. This means that any simulation of the variables must take the separate distributions into account.
# pairplot of the variables, drop missing values
print("\nDistribution of Life satisfaction, Income, Social Support and Healthy Life Expectancy globally and then by region\n")
sns.pairplot(df18.iloc[:,:6].dropna());
# could also plot the pairplot for all years from 2011 to 2018.
#sns.pairplot(df_years.iloc[:,1:7].dropna());
# pairplots showing the Region as the colour of the dots.
sns.pairplot(df18.iloc[:,:6].dropna(), hue="Region", palette="bright");
The pairplots show a distinction between the different geographic regions for most variables. Sub-Saharan Africa stands out as a region that has far different levels of life satisfaction, income and life expectancy than most other regions. The histograms for each variable are plotted again individually to see the distributions more clearly for each variable.
A closer look at each of the variables in the histograms below.
# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(9,9))
# plot the distributions of each of the main variables. At global level first. Look at Regional after
sns.histplot(df18['Life_Satisfaction'].dropna(), ax=axes[0,0], bins=10, color="r");
# set axes title
axes[0,0].set_title("Distribution of Life Ladder globally 2018");
sns.histplot(df18['Log_GDP_per_cap'].dropna(), ax=axes[0,1], bins=10, color="g");
axes[0,1].set_title("Distribution of Income globally 2018");
sns.histplot(df18['Social_Support'].dropna(), ax=axes[1,0], bins=10, color="b");
axes[1,0].set_title("Distribution of Social support globally 2018");
sns.histplot(df18['Life_Expectancy'].dropna(), ax=axes[1,1], bins=10, color="y");
axes[1,1].set_title("Distribution of Life expectancy globally 2018");
plt.tight_layout();
The distribution of Life Ladder variable looks to be normally distributed whiile there is some left skew in the other variables. Normally distributed data is considered the easiest to work with as normal distributions can be compared by looking at their means and standard deviations. Many statistical methods assume variables are normally distributed and others work better with normality. -Sustainable Development lectures[13]
Here are the summary statistics for the variables for 2018 and also over the period from 2012 to 2018 inclusive. The 2018 statistics are similar enough to the amalgamated 2012 to 2018 statistics.
# summary statistics of the dataset (2018)- just showing the main variables of interest
print("The data available for 2018 only: \n")
df18.iloc[:,0:6].describe()
# summary statistics of data from 2012 to 2018
print("Summary statistics for all years from 2012 to 2018 based on Table 2.1 data. \n")
df_years.iloc[:,3:7].describe()
Boxplots can show the central tendency, symmetry and skew of the data as well as any outliers. The rectangular boxes are bounded by the hinges representing the lower (1st) quartile and upper (3rd quartile). The median of 50th percentile is shown by the line through the box. The whiskers show the minimum and maximum values of the data excluding any outliers. A distribution is symmetric if the median is in the centre of the box and the whiskers are the same length.
While none of the variables look symmetric, the distribution of Life ladder is less asymmetric than the other variables. A skewed distribution has the median closer to the shorter whisker as is the case for Healthy Life Expectancy, Social support and Log GDP per capita. A positive/right skewed distribution has a longer top whisker than bottom while a negatively skewed / left skewed distribution has a longer lower whisker as is the case for Log GDP per capita, Social support and Healthy life expectancy.
# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(1,4, figsize=(12,3))
# plot the distributions of each of the main variables. At global level first. Look at Regional after
sns.boxplot(x=df18['Life_Satisfaction'], ax=axes[0], color="r");
sns.boxplot(x=df18['Log_GDP_per_cap'], ax=axes[1], color="g");
sns.boxplot(x=df18['Social_Support'], ax=axes[2],color="b");
sns.boxplot(x=df18['Life_Expectancy'], ax=axes[3],color="y");
plt.tight_layout();
As the pairplot showed earlier, there are wide variations between regions and for this region I will look at the distributions of each variable on a geographic regional basis. Below are the mean and standard deviation statistics showing how the central tendency and spread of the distributions vary across regions. The mean and standard deviations at a global level are then shown. There is quite a difference in the statistics on a regional basis which is lost in the overall statistics.
# look at 2018 statistics by Region, round for printing
# just look at specific columns
print("The mean and standard deviation for the variables on a regional basis. \n")
df18.iloc[:,:6].groupby(['Region']).agg([np.mean, np.std]).round(4)
# mean and standard deviation at the global level in the dataset for 2018
print("The mean and standard deviation at a global level for 2018: \n")
df18.iloc[:,1:6].agg([np.mean, np.std])
# mean and standard deviation at the global level in the dataset for 2012 to 2018
print("The mean and standard deviation at a global level for years from 2012 to 2018 \n")
df_years.iloc[:,1:7].agg([np.mean,np.std])
The statistics for 2018 and for all years from 2012 to 2018 are very similar. I am focusing on 2018 but if I need extra data I can use the dataset with all years from 2012 to 2018.
df18.groupby(['Region']).describe().round(4)
print("Central Tendency, Skew and outliers of Life Satisfaction, Income, Social Support and Life Expectancy by region \n ")
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(12,8))
sns.boxplot(y="Region", x="Life_Satisfaction", data=df18, ax=axes[0,0]);
sns.boxplot(y="Region", x="Log_GDP_per_cap", data=df18, ax=axes[0,1]);
sns.boxplot(y="Region", x="Social_Support", data=df18, ax=axes[1,0]);
sns.boxplot(y="Region", x="Life_Expectancy", data=df18, ax=axes[1,1]);
plt.tight_layout();
When the distributions of each of the variables representing life satisfaction, gdp per capita, social support and life expectancy at birth are broken down by regional level, they tell a very different story to the distributions of the variables at a global level. There are some regions that overlap with each other and there are regions at complete opposite ends of the spectrum. I think this further clarifies the need to look at the variables on a regional level when simulating the data.
For example the boxplots show that the median values of Log GPD per capita fall into roughly 3 groups (similar to the boxplots for social support) by regions with Western Europe, North America and Australia and New Zealand having the highest scores, while Southern Asia and Sub Saharan Africa have the lowest median scores and are the most variable along with the Middle East and Nortern Africa region. There is no overlap at all between the measurements for the richer regions (such as Western Europe, Australia and North America) and the poorer regions (Sub_saharan Africa and Southern Asia). The above boxplots by geographic regions show that there is great variations between regions in the distribution of GDP per capita. The boxplots also show some outliers where there are a few countries from each geographic regions which are more similar to countries in another geographic region than in their own region.
Colouring the points by regions in the plots above showed that some regions had very similar distributions to other regions and very disimilar to other regions. For instance Southern Asia and Sub_Saharan Africa are always close to each other and their distribution barely overlap if at all with regions such as Western Europe, North America and Austrlia and New Zealand. Therefore I am having a quick look at using Kmeans clustering to see how it would cluster the points rather than simulating data for the 10 regions as it is a small enough dataset.
Here I refer to a pythonprogramming blogpost on flat clustering[19]. (I just looked briefly at clustering to see if the countries could be separated into clear groups of regions and did not do a thorough clustering analysis. I chose 3 as the number of clusters to use based on the plots above but a full clustering analysis would indicate the best number of clusters).
##Adapted from code at: https://pythonprogramming.net/flat-clustering-machine-learning-python-scikit-learn/).
from sklearn.cluster import KMeans
# select the data to use. drop any missing values
x=df_years.loc[:,['Year','Life_Satisfaction','Log_GDP_per_cap','Social_Support','Life_Expectancy','RegionCode']]
x=x.dropna()
# initialise kmeans to be the KMeans algorithm (flat clustering) with the number of clusters.
kmeans=KMeans(n_clusters=3)
# fit the data
kmeans.fit(x)
# create a copy of the data
clusters=x.copy()
# add column with the predictions from the kmeans model
clusters['cluster_pred']=kmeans.fit_predict(x)
clusters[['RegionCode','cluster_pred']]
# set the figure size
plt.rcParams["figure.figsize"] = (12,3)
# use seaborn scatterplot, hue by cluster group
g= sns.scatterplot(x=clusters['Life_Satisfaction'],y=clusters['Social_Support'],hue=clusters['cluster_pred'],palette='rainbow')
# place legends outside the box - # https://stackoverflow.com/q/53733755 to move legends outside of box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);
# add title
plt.title("Clusters: Social support and Life Ladder");
# scatterplot of life ladder and social support
g =sns.scatterplot(y = df_years['Social_Support'],x= df_years['Life_Satisfaction'],hue=df_years['Region'],palette="bright")
# https://stackoverflow.com/q/53733755 to move legends outside of box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);
# add title
plt.title("Scatterplot of Social support vs Life Ladder");
clusters.columns
# plot the clusters gdp versus life ladder
g= sns.scatterplot(x=clusters['Life_Satisfaction'],y=clusters['Log_GDP_per_cap'],hue=clusters['cluster_pred'],palette='rainbow')
# legends outside of box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);
# add title
plt.title("Clustering of Life Ladder vs Log GDP");
# scatterplot of actual log gdp vs life ladder
g =sns.scatterplot(y = df_years['Log_GDP_per_cap'],x= df_years['Life_Satisfaction'],hue=df_years['Region'], palette='bright')
# https://stackoverflow.com/q/53733755 to move legends outside of box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);
# add title
plt.title("Scatterplot of Life Ladder vs Log GDP per capita");
# plot the clusters Life ladder vs Life Expectancy
g=sns.scatterplot(x=clusters['Life_Satisfaction'],y=clusters['Life_Expectancy'],hue=clusters['cluster_pred'],palette='rainbow')
plt.title("Clustering of Life Ladder vs Life Expectancy");
# place legends outside box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);
# scatterplot of life expectancy vs life ladder
g =sns.scatterplot(y = df_years['Life_Expectancy'],x= df_years['Life_Satisfaction'],hue=df_years['Region'], palette="bright")
# https://stackoverflow.com/q/53733755 to move legends outside of box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);
The covariance and correlation statistics can be used to determine if there is a linear relationship between variables and if one variable tends to occur with large or small values of another variable.
Covariance is a measure of the joint variability of two random variables while the Pearson correlation coefficient is the normalised version of the covariance and it shows by it's magnitude the strength of the linear relationship between two random variables. The covariance is a measure of how much two variables vary with each other and in what direction a variable changes when another variable changes. Pandas cov
function can be used to get the covariances between pairs of variables.
The covariances below are all positive which means that when one of the variables is above it's mean, the other variable is also likely to be above it's mean.
The correlation coefficients are easier to interpret than the covariances. If there is a strong positive relationship between two variables then the value of the correlation coefficient will be close to 1 while a strong negative relationship between two variables will have a value close to -1. If the correlation coefficient is zero this indicates that there is no linear relationship between the variables.
Pandas corr
function is used to get the correlations between pairs of variables. There seems to be a strong positive correlation between each of the pairings between the four variables with the strongest relationship between Healthy Life Expectancy and Log GDP per capita.
As noted the World Happiness Report researchers found that differences in social support, incomes and healthy life expectancy are the three most important factors in determining the overall happiness score. Income in the form of GDP per capita for a country has a strong and positive correlation with the averages country scores of the response to the Cantril Ladder question.
print("The covariance of the variables in the dataset: \n")
# pandas cov function on the 4 variables in the dataset
df18.iloc[:,:6].cov()
print("The correlation between variables in the dataset: \n")
# pandas corr function on the 4 variables
df18.iloc[:,:6].corr()
print("The correlation between variables in the dataset across all years from 2011 to 2018: \n")
df_years.iloc[:,3:7].corr()
Seaborn regplot
or lmplot
functions can be used to visualize a linear relationship between variables as determined through regression. Statistical models are used to estimate a simple relationship between sets of observations that can be quickly visualised. These can be more informative than looking at the numbers alone. The regression plots below show a positive linear relationship between the pairs of variables.
df18.columns
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,3))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(y ="Life_Satisfaction",x="Log_GDP_per_cap", data=df18, ax=axes[0])
sns.regplot(y ="Life_Satisfaction",x="Social_Support", data=df18, ax=axes[1])
sns.regplot(y ="Life_Satisfaction",x="Life_Expectancy", data=df18, ax=axes[2])
# set the axis limits for this subplot.
axes[2].set_ylim(0,10); axes[2].set_xlim(40,90)
plt.tight_layout();
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,3))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(x ="Life_Expectancy",y="Log_GDP_per_cap", data=df18, ax=axes[0])
sns.regplot(x ="Life_Expectancy",y="Social_Support", data=df18, ax=axes[1])
sns.regplot(x ="Life_Expectancy",y="Life_Satisfaction", data=df18, ax=axes[2])
plt.tight_layout();
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,3))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(x ="Log_GDP_per_cap",y="Life_Expectancy", data=df18, ax=axes[0])
sns.regplot(x ="Log_GDP_per_cap",y="Social_Support", data=df18, ax=axes[1])
sns.regplot(x ="Log_GDP_per_cap",y="Life_Satisfaction", data=df18, ax=axes[2])
axes[2].set_ylim(0,12)
plt.tight_layout();
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,3))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(x ="Social_Support",y="Life_Expectancy", data=df18, ax=axes[0])
sns.regplot(x ="Social_Support",y="Log_GDP_per_cap", data=df18, ax=axes[1])
sns.regplot(x ="Social_Support",y="Life_Satisfaction", data=df18, ax=axes[2])
axes[2].set_ylim(0,12)
plt.tight_layout();
The covariance and correlation statistics suggest a strong linear relationship between the pairs of variables and the regression plots confirm this. Life satisfactions are positively correlated with income levels (Log GDP per capita), social support and healthy life expectancy. Income and life expectancy are very highly correlated.
(I might be interpreting the correlations incorrectly using the region codes. What I wanted to see was if the region the country is correlated with the variables as it would seem to be the case. Life expectancy and income are generally related to the region.)
Having collected some data for the real-world phenomenon of the World Happiness Scores, I looked at the type of variables involved, their distributions and the relationships between the variables. In this section I will look more closely at the distributions of each of the variables and then simulate some data that hopefully will have properties that match the actual variables in the real sample dataset.
The aim of the World Happiness Scores is to see what countries or regions rank the highest in overall happiness and how each of the six factors (economic production, social support, life expectancy, freedom, absence of corruption, and generosity) contribute to happiness levels. The World Happiness Report researchers studied how the different factors contribute to the happiness scores and the extent of each effect. While these factors have no impact on the total score reported for each country, they were analysed to explain why some countries rank higher than others and they describe the extent to which these factors contribute in evaluating the happiness in each country.
As the differences in social support, incomes and healthy life expectancy are the three most important factors in determining the overall happiness score, this is what I focus on in this project.
Therefore I will simulate the following variables:
Regions and Countries
Life Ladder / Life Satisfaction
Log GDP per capita
Social Support
Healthy Life Expectancy at birth
There are countries from 10 geographic regions in the sample dataset for 2018. While the region was not in the actual dataset for 2018, I added it in from a dataset for a previous year (2015). This allowed me to see the how the values for each variable in the dataset differ by region.
The rough clustering earlier divided the countries into 3 clustered regions. Of the clustered countries, 10 values were lost due to missing values. If I had more time then the missing values could be imputed based on previous years but for now there is at least 100 which is sufficient for this project.
#clusters for 2018 data
c18=clusters.loc[clusters.loc[:,'Year']==2018]
len(c18) # only 126 countries due to missing values in some rows
Cy=clusters # all the data for all years
# the number of clusters for all years
len(Cy)
Below I look at the actual proportions of countries in each of the 10 geographic regions which are then plotted. The countplot shows the distribution of countries in the real dataset over the 10 regions. To avoid the axes labels overlapping, I have followed advice on a blogpost drawing from data[20] to rotate the axes labels. Also to show the plot by descending order of the region with the most countries.
# see how many countries in each Region
df18.Region.value_counts()
# find how many values in total
print(f"There are {len(df18.Region)} countries in the dataset. ")
# calculate the proportion of countries in each region
prob=(df18.Region.value_counts()/len(df18.Region)).round(3)
#print(f"These countries are split over 10 geographic regions as follows: \n{df18.Region.value_counts()/len(df18.Region)}.3f %", sep="\n")
# make sure they sum to 1.
prob.sum()
# set up the matplotlib plot
plt.rcParams["figure.figsize"] = (10,4)
# set the tick size
plt.rcParams["xtick.labelsize"] = 10
sns.set_palette("pastel")
# https://stackoverflow.com/q/46623583 order countplot by count.
chart= sns.countplot(x=df18.Region, order= df18['Region'].value_counts().index)
# rotate axes labels to avoid overlap - see <https://www.drawingfromdata.com/how-to-rotate-axis-labels-in-seaborn-and-matplotlib>
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)
plt.title("Distribution of Countries in Geographic Regions in in the dataset 2018");
Here I will use the dataframe c18
created above which has the clustered regions added. All other data in it is the same as the original dataframes. The dataframe clusters
contains all the data from df_years
s but with the addition of the predicted clusters.
# see the top of the dataframe
clusters.head()
# find how many values in total
print(f"There are {len(c18.cluster_pred)} countries in the dataset. ")
# calculate the proportion of countries in each region
CRprob=(c18.cluster_pred.value_counts()/len(c18.cluster_pred)).round(3)
print(f"These countries are split over 10 geographic regions as follows: \n{c18.cluster_pred.value_counts()/len(c18.cluster_pred)}.3f %", sep="\n")
# mnake sure they sum to 1.
CRprob.sum()
sns.set_palette("pastel")
# https://stackoverflow.com/q/46623583 order countplot by count.
chart= sns.countplot(x=c18.cluster_pred, order= c18['cluster_pred'].value_counts().index)
# rotate axes labels to avoid overlap - see <https://www.drawingfromdata.com/how-to-rotate-axis-labels-in-seaborn-and-matplotlib>
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)
plt.title("Distribution of Countries in Geographic Regions in in the dataset 2018");
Here I use the numpy.random.choice
function to generates a random sample of regions from a one dimensional array of made-up regions using the proportions from the real dataset.
# makeup region names:
Sim_regions = ['SimReg0','SimReg1','SimReg2']
# assign countries to region based on proportions in actual dataset.
p=[c18.cluster_pred.value_counts()/len(c18.cluster_pred)]
# simulate regions using the probabilities based on proportion in actual dataset
CR= np.random.choice(Sim_regions, 136, p=[p[0][0],p[0][1],p[0][2]])
CR
Here I am just going to create some country names by appending a digit to the string 'Country_'. I don't have the imagination to start creating over 100 new countries!
# create a list of countries by appending number to 'country'
countries =[]
# use for loop and concatenate number to string
for i in range(136):
countryname = 'Sim_Country_'+str(i)
# append countryname to list of countries
countries.append(countryname)
I now have countries and regions for the simulated dataset which I will add to a dataframe.
# create a dataframe to hold the simulated variables.
sim_df = pd.DataFrame(data={'Sim_Country':countries, 'Sim_Region':CR})
sim_df
# set the figure size
plt.rcParams["figure.figsize"] = (10,4)
sns.set_palette("muted")
# countplot of simulated region, order by counts
sns.countplot(x=sim_df.Sim_Region, order= sim_df['Sim_Region'].value_counts().index)
# add title
plt.title("Distribution of Simulated Regions in the Simulated dataset");
The countplot above shows the simulated regions in the simulated dataset, ordered by the count of the countries in each region. Comparing it to the actual dataset shows that the split looks good.
The Life Ladder scores and rankings are based on answers to the Cantril Ladder question from the Gallup World Poll where the participants rate their own lives on a scale of 0 to 10 with the best possible life for them being a 10 and the worst a 0. The rankings are from nationally representative samples for the years between 2016 and 2018 and are based entirely on the survey scores using weights to make the estimates representative.
The World Happiness Report shows how levels of GDP, life expectancy, generosity, social support, freedom, and corruption - contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. The report itself looks at life evaluations for the 2016 to 2018 period and the rankings across all countries in the study. Here are the statistics for the Life Ladder variable for the 2018 dataset.
Life Ladder / Life Satisfaction is a non-negative real number with statistics as follows. The minimum possible value is 0 and the maximum is 10 although the range of values in the actual dataset is smaller. The values in the dataset are national averages of the answers to Cantril Ladder question. The range of values vary across regions.
df18['Life_Satisfaction'].describe()
The distribution of Life Ladder is plotted below. The plot on the left shows the distribution for the dataframe containing 2018 data only and the plot on the right shows the distribution for the dataframe containing all years.
f,axes=plt.subplots(1,2, figsize=(12,3))
# distribution of the Life Ladder for 2018
sns.distplot(df18['Life_Satisfaction'], label="Life Ladder 2018", ax=axes[0])
# distribution of life ladder from the dataframe with the clusters. some missing values
sns.distplot(c18['Life_Satisfaction'], label="Life Ladder - 2018", ax=axes[0])
axes[0].set_title("Distribution of Life Ladder for 2018 only")
#axes[0].legend()
# kdeplot of the life ladder for all years in the extended dataset (WHR Table2.1)
sns.distplot(df_years['Life_Satisfaction'], label="Life Ladder - all years", ax=axes[1])
sns.distplot(c18['Life_Satisfaction'], label="Life Ladder - 2018",ax=axes[1])
axes[1].set_title("Distribution of Life Ladder for 2012 to 2018");
#axes[1].legend();
The distribution plots for the Life Ladder variable does suggest that it is normally distributed but there are some tests that can be used to clarify this. A blogpost on tests for normality[21] at machinelearningmastery.com outlines how it is important when working with a sample of data to know whether to use parametric or nonparametric statistical methods. If methods used assume a Gaussian distribution when it is not the case then findings can be incorrect or misleading. In some cases it is enough to assume the data is normal enough to use parametric methods or to transform the data to be normal enough.
Parametric statistical methods assume that the data has a known and specific distribution, often a Gaussian distribution. If a data sample is not Gaussian, then the assumptions of parametric statistical tests are violated and nonparametric statistical methods must be used.[21]
Normality tests can be used to check if your data sample is from a Gaussian distribution or not.
Tests for normality include the Shapiro_Wilk Normality Test, the D'Agostino and Pearson's Test, the Anderson-Darling Test.
The histograms above show the characteristic bell-shape curve of the normal distribution and indicates that the Life Ladder variable is gaussian or approximately normally distributed.
The Quantile-Quantile Plot (QQ plot) is another plot that can be used to check if the distribution of a data sample. It generates its own sample of the idealised distribution that you want to compare your sample of data to. The idealised samples are divided into groups and each data point in the sample is paired with a similar member from the idealised distribution at the same cumulative distribution.
A scatterplot is drawn with the idealised values on the x-axis and the data sample on the y-axis. The resulting points are plotted as a scatter plot with the idealized value on the x-axis and the data sample on the y-axis. If the result is a straight line of dots on the diagonal from the bottom left to the top right this indicates a perfect match for the distribution whereas if the dots deviate far from the diagonal line.
Here a QQ plot is created for the Life Ladder sample compared to a Gaussian distribution (the default).
# import the qqplot function
from statsmodels.graphics.gofplots import qqplot
# Life Ladder observations from the dataset. can use c18 dataframe or dfh
data = c18['Life_Satisfaction']
# plot a q-q plot, draw the standardised line
qqplot(data, line='s')
plt.title("QQ plot to test for normality of Life Ladder");
The QQ plot does seem to indicate to me that the Life Ladder is normally distributed. The scatter plots of points do mostly follow the diagonal pattern for a sample from a Gaussian distribution. I will use some of the normality tests as outlined on the blogpost on tests for normality[21]. These tests assume the sample was drawn from a Gaussian distribution and test this as the null hypothesis. A test-statistic is calculated and a p-value to interpret the test statistic. A threshold level called alpha is used to interpret the test. This is typically 0.05. If the p-value is less than or equal to alpha, then reject the null hypothesis. If the p-value is greater than alpha then fail to reject the null hypothesis. In general a larger p-value indicates that the sample was likely to have been drawn from a Gaussian distribution. A result above 5% doesn't mean the null hypothesis is true but that it is very likely true given the evidence available.
# adapted from https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
# import the shapiro test from scipy stats
from scipy.stats import shapiro
# calculate the test statistic and the p-value to interpret the test on the Life Ladder sample
stat, p = shapiro(df18['Life_Satisfaction'])
# interpret the test
alpha= 0.05
print('stat=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('The sample of Life Satisfaction/ Life Ladder looks Gaussian (fail to reject H0)')
else:
print('The sample of Life Ladder does not look Gaussian (reject H0)')
This test calculates the kurtosis and skewness of the data to see if the data distribution is not normal like. The skew is a measure of asymmetry in the data while hurtosis how much of the distribution is in the tails.
# D'Agostino and Pearson's Test adapted from https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
# import from scipy.stats
from scipy.stats import normaltest
# normality test
stat, p = normaltest(df18['Life_Satisfaction'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
print('The sample of Life Ladder looks Gaussian (fail to reject H0)')
else:
print('The sample of Life Ladder does not look Gaussian (reject H0)')
This is a statistical test that can be used to evaluate whether a data sample comes from one of among many known data samples adn can be used to check whether a data sample is normal. The test is a modified version of the Kolmogorov-Smirnov test[22] which is a nonparametric goodness-of-fit statistical test. The Anderson-Darling test returns a list of critical values rather than a single p-value. The test will check against the Gaussian distribution (dist=’norm’) by default.
# Anderson-Darling Test - adapted from https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
from scipy.stats import anderson
# normality test
result = anderson(df18['Life_Satisfaction'])
print('Statistic: %.3f' % result.statistic)
p = 0
# checking for a range of critical values
for i in range(len(result.critical_values)):
sl, cv = result.significance_level[i], result.critical_values[i]
if result.statistic < result.critical_values[i]:
# fail to reject the null that the data is normal if test statistic is less than the critical value.
print('%.3f: %.3f, data looks normal (fail to reject H0)' % (sl, cv))
else:
# reject the null that the data is normal if test statistic is greater than the critical value.
print('%.3f: %.3f, data does not look normal (reject H0)' % (sl, cv))
The sample of Life Ladder for 2018 does appear to be normally distributed based on the above tests. Therefore I can go ahead and simulate data for this variable using the normal distribution using the sample mean and standard deviation statistics.
I will now simulate 'Life Ladder' using the numpy.random.normal
function. This takes 3 arguments - loc
for the mean or centre of the distributon, scale
for the spread or width of the distribution - the standard deviation, and size
for the number of samples to draw from the distribution.
Without setting a seed the actual values will be different each time but the shape of the distribution should be the same.
print(f" The mean of the sample dataset for 2018 is {c18['Life_Satisfaction'].mean():.3f} ")
print(f" The standard deviation of the sample dataset for 2018 is {c18['Life_Satisfaction'].std():.3f} ")
c18.shape
df_years.loc[df_years.loc[:,'Year']==2016]
Creating subsets of the data for each in order to be able to see how the distribution varies from year to year.
# subset of the data for each year
c17= Cy.loc[Cy.loc[:,'Year']==2017]
c16= Cy.loc[Cy.loc[:,'Year']==2016]
c15= Cy.loc[Cy.loc[:,'Year']==2015]
c14= Cy.loc[Cy.loc[:,'Year']==2014]
# plot the distribution of life satisfaction variable for each of the years 2014 to 2018
sns.kdeplot(c14['Life_Satisfaction'], label="2014")
sns.kdeplot(c15['Life_Satisfaction'], label="2015")
sns.kdeplot(c16['Life_Satisfaction'], label="2016")
sns.kdeplot(c17['Life_Satisfaction'], label="2017")
sns.kdeplot(c18['Life_Satisfaction'], label="2018")
plt.title("Distribution of Life Satisfaction for each year 2014 to 2017");
Here some data is simulated for the Life Satisfaction/Ladder variable is simulated using the mean and standard deviation for the 2018 data. As the variable appears to be normally distributed it would be ok to do this. However as mnay of the other varaibles are not normally distributed at the dataset level but do look gaussian at the clustered region level I will be simulating the other variables this way.
sns.kdeplot(c18['Life_Satisfaction'], label="Life Satisfaction 2018", shade="True")
# simulate data based on statistics from the 2018 sample dataset
# simulate life ladder using sample mean and standard deviation
for i in range(10):
x= np.random.normal(c18['Life_Satisfaction'].mean(),c18['Life_Satisfaction'].std(),130)
sns.kdeplot(x)
# plot the distribution of the actual dataset
# plot the simulated data
plt.title("Actual data versus the Simulated data for Life Ladder 2018")
plt.legend();
The above plots of the distribution of Life_Satisfaction illustrate how the distribution changes slightly for each simulation due to the randomness.
Next plot the distribution for the 4 clustered regions.
c18.groupby('cluster_pred').count()
# set up the matploptlib figure,
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Life_Satisfaction'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Life_Satisfaction'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Life_Satisfaction'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Life Ladder for the 3 clusters for 2017");
plt.legend();
# separate out the clustered regions for all years from 2011 to 2018
cluster0 =clusters.loc[clusters.loc[:,'cluster_pred']==0]
cluster1 =clusters.loc[clusters.loc[:,'cluster_pred']==1]
cluster2 =clusters.loc[clusters.loc[:,'cluster_pred']==2]
## separate out the clustered regions for 2018 data
c18_0 =c18.loc[c18.loc[:,'cluster_pred']==0]
c18_1 =c18.loc[c18.loc[:,'cluster_pred']==1]
c18_2 =c18.loc[c18.loc[:,'cluster_pred']==2]
The distribution of the Life Satisfaction (Life Ladder) variable looks quite normal when taken over all the countries in the dataset. When shown for each of the 3 clustered regions, the distributions are all still normal looking but with very different shapes and locations.
# cluster 0
print(f" The mean of the sample dataset for 2018 is {cluster0['Life_Satisfaction'].mean():.3f} ")
print(f" The standard deviation of the sample dataset for 2018 is {cluster0['Life_Satisfaction'].std():.3f} ")
c18.shape
simulate for each cluster based on their statistics
LLc0= np.random.normal(c18_0['Life_Satisfaction'].mean(),c18_0['Life_Satisfaction'].std(),31)
LLc1= np.random.normal(c18_1['Life_Satisfaction'].mean(),c18_1['Life_Satisfaction'].std(),39)
LLc2= np.random.normal(c18_2['Life_Satisfaction'].mean(),c18_2['Life_Satisfaction'].std(),56)
# set up the matploptlib figure,
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(LLc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(LLc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(LLc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Life Satisfaction for 3 region clusters");
f,axes=plt.subplots(1,3, figsize=(13,3))
for i in range(5):
sns.kdeplot(np.random.normal(c18_0['Life_Satisfaction'].mean(),c18_0['Life_Satisfaction'].std(),31), shade=True, ax=axes[0])
for i in range(5):
sns.kdeplot(np.random.normal(c18_1['Life_Satisfaction'].mean(),c18_1['Life_Satisfaction'].std(),39), shade=True, ax=axes[1])
for i in range(5):
sns.kdeplot(np.random.normal(c18_2['Life_Satisfaction'].mean(),c18_2['Life_Satisfaction'].std(),59), shade=True, ax=axes[2])
plt.suptitle("Distributions of 5 simulations for Life Satisfaction for each of the 3 groups")
The variable 'Social support' was the result of a question in the Gallop World Poll with the national average of the binary responses for each country to the GWP question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”.
The distribution from the dataset shows that it is left skewed. The scatterplots above showed that it is positively correlated with life satisfaction, income levels and healthy life expectancy. The boxplots below show that the median values fall into roughly 3 groups by regions with Western Europe, North America and Australia and New Zealand having the highest scores, while Southern Asia and Sub Saharan Africa have the lowest median scores and a wider spread.
Like Life ladder variable, the values are the result of national averages to questions in the Gallup World Poll. In this case the question had a binary answer 0 or 1 and therefore the actual values in the dataset range between 0 and 1. Social Support is a non-negative real values. The statistics vary from region to region.
df18['Social_Support'].describe()
The distribution of Life Ladder is plotted below. The plot on the left shows the distribution for the dataframe containing 2018 data only and the plot on the right shows the distribution for the dataframe containing all years.
f,axes=plt.subplots(1,2, figsize=(12,3))
# distribution of the Social support for 2018
sns.distplot(df18['Social_Support'], label="Social Support 2018", ax=axes[0])
# distribution of social support from the dataframe with the clusters. some missing values
sns.distplot(c18['Social_Support'], label="Social Support - 2018", ax=axes[0])
axes[0].set_title("Distribution of Social Support for 2018 only")
#axes[0].legend()
# kdeplot of the Social Support for all years in the extended dataset (WHR Table2.1)
sns.distplot(df_years['Social_Support'].dropna(), label="Social_Support - all years", ax=axes[1])
sns.distplot(c18['Social_Support'], label="Social_Support - 2018",ax=axes[1])
axes[1].set_title("Distribution of Social_Support for 2012 to 2018");
#axes[1].legend();
# plot the distribution of life satisfaction variable for each of the years 2014 to 2018
sns.kdeplot(c14['Social_Support'], label="2014")
sns.kdeplot(c15['Social_Support'], label="2015")
sns.kdeplot(c16['Social_Support'], label="2016")
sns.kdeplot(c17['Social_Support'], label="2017")
sns.kdeplot(c18['Social_Support'], label="2018")
plt.title("Distribution of Social_Support' for each year 2014 to 2017");
# set up the matploptlib figure,
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Social_Support'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Social_Support'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Social_Support'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Social Support variable for the 3 clusters for 2017");
plt.legend();
The distribution of Social Support for all the countries in 2018 is a left-skewed distribution with a long tail to the left. When this is broken down by cluster groups of regions, the distribution of each cluster looks more normal shaped although the centres of the distribution and the spread vary widely. Cluster region 0 (green) has social support values centred around different values.
c18.groupby('cluster_pred').mean()
SSc0= np.random.normal(c18_0['Social_Support'].mean(),c18_0['Social_Support'].std(),31)
SSc1= np.random.normal(c18_1['Social_Support'].mean(),c18_1['Social_Support'].std(),39)
SSc2= np.random.normal(c18_2['Social_Support'].mean(),c18_2['Social_Support'].std(),56)
# set up the matploptlib figure,
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(SSc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(SSc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(SSc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Social Support for 3 region clusters");
When I first looked at simulating the social support variable I considered non-parametric ways of working with the data such as the bootstrap resampling method. Bootstrapping is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples. Samples are constructed by drawing observations from a large data sample one at a time and then returning the observation so that it could be drawn again. Therefore any given observation could be included in the sample more than once while some observations might never be drawn. I referred to a blogpost on Machinelearningmastery.com[23] which outlines the steps to implement it using the scikit-learn resample function[24] which takes as arguments the data array, whether or not to sample with replacement, the size of the sample, and the seed for the pseudorandom number generator used prior to the sampling.
Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution function of the observed data. In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples with replacement, of the observed dataset (and of equal size to the observed dataset). The basic idea of bootstrapping is that inference about a population from sample data can be modelled by resampling the sample data and performing inference about a sample from resampled data. As the population is unknown, the true error in a sample statistic against its population value is unknown. In bootstrap-resamples, the 'population' is in fact the sample, and this is known; hence the quality of inference of the 'true' sample from resampled data (resampled → sample) is measurable. Wikipedia wiki on Bootstrapping)[25]
The process for building one sample is:
Based on the information above I use sampling with replacement to simulate the Social support variable using the sample size the same as the original dataset which is 136 observations. numpy.random.choice function can be used for this purpose. The mean of the Social support variable in the dataset can be considered as a single estimate of the mean of the population of social support while the standard deviation is an estimate of the variability. The simplest bootstrap method would involve taking the original data set of N(136) Social support values and sampling from it to form a new sample - the 'resample' or bootstrap sample that is also of size 136. If the bootstrap sample is formed using sampling with replacement from the original data sample with a large enough size then there should be very little chance that the bootstrap sample will be the exact same as the original sample. This process is repeated many times (thousands) and the mean computed for each bootstrap sample to get the bootstrap estimates which can then be plotted on a histogram which is considered an estimate of the shape of the distribution.
Here I use a loop to draw multiple random samples with replacements from the dataset. Additionally the means for each sample can be calculated and then all the means from the different samples are plotted to show their distribution but for this project I just need a simulated data set.
# using the Social support data from the dataset.
social=df18['Social_Support']
# create a list to store the means of the samples, set the number of simulations
mean_social_sim, sims = [], 20
# use loop to create 100 samples - takes very long to do 1000
# plot the distribution of social support variable from the dataset, add title
sns.kdeplot(social, label="Actual Data", shade=True)
for _ in range(sims):
# draw a random sample from social with replacement and store it in social_sample
social_sample=np.random.choice(social, replace=True, size=len(social))
# plot the distribution of each sample
sns.kdeplot(social_sample)
# add title
plt.title("Simulating using random sample with replacement")
social_mean = np.mean(social_sample)
# append the mean of each sample to mean_social
mean_social_sim.append(social_mean)
The main use of the bootstrap is making inferences about an estimate for a population parameter on sample data. For the purposes of this project I just need to simulate a single dataset. However by calculating the bootstrap means and comparing them to the dataset I am attempting to replicate shows that it is a suitable method here when the data does not follow a particular distribution.
df18['Social_Support'].describe()
np.mean(mean_social_sim)
The GDP per capita in the World Happiness Report dataset are in purchasing power parity at constant 2011 international dollar prices which are mainly from the World Development Indicators in 2018. Purchasing power parity is necessary when looking to compare GDP per capita between countries which is what the World Happiness Report seeks to do. Nominal GDP would be fine when just looking at a single country. The log of the GDP figures is taken.
Per capita GDP is the Total Gross Domestic Product for a country divided by its population and breaks down a country's GDP per person. As the World Happiness Report states this is considered a universal measure for gauging the prosperity of nations.
The earlier plots showed that the distribution of log GPD per capita in the dataset is not normally distributed. Per capita GDP is generally a unimodal but skewed distribution. There are many regional variations in income across the world. The distribution of Log GPD per capita appears somewhat left skewed.
The scatterplots above showed that it is positively correlated with life satisfaction, social support and healthy life expectancy.
The log of GDP per capita is used to represent the Gross Domestic Product per capita. It is a non-negative real value. Because the log of the GDP per capita figures are used the range is between 6 and 12 although this varies from region to region. The
df['Log_GDP_per_cap'].describe()
f,axes=plt.subplots(1,2, figsize=(12,3))
# distribution of social support from the dataframe with the clusters. some missing values
sns.distplot(c18['Log_GDP_per_cap'].dropna(), label="Log_GDP_per_cap - 2018", ax=axes[0])
axes[0].set_title("Distribution of Log_GDP_per_cap for 2018 only")
#axes[0].legend()
# kdeplot of the Social Support for all years in the extended dataset (WHR Table2.1)
sns.distplot(clusters['Log_GDP_per_cap'].dropna(), label="Log_GDP_per_cap - all years", ax=axes[1])
axes[1].set_title("Distribution of Log_GDP_per_cap for 2012 to 2018");
#axes[1].legend();
# plot the distribution of Log GDP per capita variable for each of the years 2014 to 2018
sns.kdeplot(c14['Log_GDP_per_cap'], label="2014")
sns.kdeplot(c15['Log_GDP_per_cap'], label="2015")
sns.kdeplot(c16['Log_GDP_per_cap'], label="2016")
sns.kdeplot(c17['Log_GDP_per_cap'], label="2017")
sns.kdeplot(c18['Log_GDP_per_cap'], label="2018")
plt.title("Distribution of Log_GDP_per_cap' for each year 2014 to 2017");
# set up the matploptlib figure,
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Log_GDP_per_cap'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Log_GDP_per_cap'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Log_GDP_per_cap'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Log_GDP_per_cap variable for the 3 clusters for 2017");
plt.legend();
When the distribution of Log GDP per capita is broken down by cluster groups of regions, the distribution of each cluster looks more normal shaped although the centres of the distribution and the spread vary widely as with the other variables.
c18.groupby('cluster_pred').mean()
GDPc0= np.random.normal(c18_0['Log_GDP_per_cap'].mean(),c18_0['Log_GDP_per_cap'].std(),31)
GDPc1= np.random.normal(c18_1['Log_GDP_per_cap'].mean(),c18_1['Log_GDP_per_cap'].std(),39)
GDPc2= np.random.normal(c18_2['Log_GDP_per_cap'].mean(),c18_2['Log_GDP_per_cap'].std(),56)
# set up the matploptlib figure,
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(GDPc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(GDPc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(GDPc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Log GDP per capita for 3 region clusters");
According to the World Happiness Report, Healthy life expectancies at birth are based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository. Some interpolation and exterpolation was used to get the data for the years covered in the 2018 report. As some countries were not covered in the WHO data, other sources were used by the researchers.
The values are non-negative real numbers between 48 and 77 years but this varies from year to year as the statistics showed.
df18['Life_Expectancy'].describe()
c18.columns
f,axes=plt.subplots(1,2, figsize=(12,3))
# distribution of Life_Expectancy from the dataframe with the clusters. some missing values
sns.distplot(c18['Life_Expectancy'].dropna(), label="Life_Expectancy - 2018", ax=axes[0])
axes[0].set_title("Distribution of Life_Expectancy for 2018 only")
#axes[0].legend()
# kdeplot of the Life_Expectancy from 2012 to 2018)
sns.distplot(clusters['Life_Expectancy'].dropna(), label="Life_Expectancy - all years", ax=axes[1])
axes[1].set_title("Distribution of Life_Expectancy for 2012 to 2018");
#axes[1].legend();
# plot the distribution of Life_Expectancy variable for each of the years 2014 to 2018
sns.kdeplot(c14['Life_Expectancy'], label="2014")
sns.kdeplot(c15['Life_Expectancy'], label="2015")
sns.kdeplot(c16['Life_Expectancy'], label="2016")
sns.kdeplot(c17['Life_Expectancy'], label="2017")
sns.kdeplot(c18['Life_Expectancy'], label="2018")
plt.title("Distribution of Life_Expectancy' for each year 2014 to 2017");
# set up the matploptlib figure,
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Life_Expectancy'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Life_Expectancy'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Life_Expectancy'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Life_Expectancy variable for the 3 clusters for 2018");
plt.legend();
Looking at the Life Expectancy variable by cluster group, the distribution looks normal for the cluster region 1 while the other two groups are less so. There seems to be two peaks in the distribution for cluster group 0 which probably needs to be broken out more.
c18.groupby('cluster_pred').mean()
LEc0= np.random.normal(c18_0['Life_Expectancy'].mean(),c18_0['Life_Expectancy'].std(),31)
LEc1= np.random.normal(c18_1['Life_Expectancy'].mean(),c18_1['Life_Expectancy'].std(),39)
LEc2= np.random.normal(c18_2['Life_Expectancy'].mean(),c18_2['Life_Expectancy'].std(),56)
# set up the matploptlib figure,
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(LEc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(LEc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(LEc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Life_Expectancy for 3 region clusters");
Now that the individual variables have been simulated, the next step is to assemble the variables into a dataframe and ensure that the overall distributions and relationships are similar to an actual dataset. Because of the variations between the three groups I will create a dataframe for each group and then concatenate the three dataframes together.
Here I create 31 countries and add the simulated data created earlier using numpy.random functions.
# create a list of countries by appending number to 'country'
G0_countries =[]
# use for loop and concatenate number to string
for i in range(31):
countryname = 'Sim_Country_'+str(i)
# append countryname to list of countries
G0_countries.append(countryname)
# create a dataframe and add the columns
G0=pd.DataFrame()
G0['Country']=G0_countries
# add columns based on simulations
G0['Life_Satisfaction']=LLc0
G0['Log_GDP_per_cap']=GDPc0
G0['Social_Support']=SSc0
G0['Life_Expectancy']=LEc0
# add group column
G0['Group']='Group0'
G0
Here I create a dataframe for the 39 group 1 simulated data points.
# create a list of countries by appending number to 'country'
G1_countries =[]
# use for loop and concatenate number to string
for i in range(39):
countryname = 'Sim_Country_'+str(i)
# append countryname to list of countries
G1_countries.append(countryname)
# create a dataframe and add the columns
G1=pd.DataFrame()
G1['Country']=G1_countries
# add columns based on simulations
G1['Life_Satisfaction']=LLc1
G1['Log_GDP_per_cap']=GDPc1
G1['Social_Support']=SSc1
G1['Life_Expectancy']=LEc1
# add group column
G1['Group']='Group1'
G1
Here I create a dataframe for the 56 group 1 simulated data points.
# create a list of countries by appending number to 'country'
G2_countries =[]
# use for loop and concatenate number to string
for i in range(56):
countryname = 'Sim_Country_'+str(i)
# append countryname to list of countries
G2_countries.append(countryname)
# create a dataframe and add the columns
G2=pd.DataFrame()
G2['Country']=G2_countries
# add columns based on simulations
G2['Life_Satisfaction']=LLc2
G2['Log_GDP_per_cap']=GDPc2
G2['Social_Support']=SSc2
G2['Life_Expectancy']=LEc2
# add group column
G2['Group']='Group2'
G2
# using the pandas concat function to add the dataframes together.
result=pd.concat([G0,G1,G2])
result.head()
sns.pairplot(result, hue='Group', palette="colorblind");
data2018 =c18.drop(['Year', 'RegionCode'], axis=1)
data2018.groupby('cluster_pred').count()
result.groupby('Group').describe()
data2018
data2018['cluster_pred']=data2018["cluster_pred"].replace(0,'Group2')
data2018['cluster_pred']=data2018["cluster_pred"].replace(1,'Group0')
data2018['cluster_pred']=data2018["cluster_pred"].replace(2,'Group1')
sns.pairplot(data2018, hue='cluster_pred', palette="bright", hue_order=['Group0','Group1','Group2']);
data2018.groupby('cluster_pred').mean()
result.groupby('Group').mean()
# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(9,9))
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Life_Satisfaction'], ax=axes[0,0], shade=True, label="simulated")
sns.kdeplot(data2018['Life_Satisfaction'], ax=axes[0,0], shade=True,label="actual")
# set axes title
axes[0,0].set_title("Distribution of Life Ladder")
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Log_GDP_per_cap'], ax=axes[0,1], shade=True, label="simulated")
sns.kdeplot(data2018['Log_GDP_per_cap'].dropna(), ax=axes[0,1], shade=True, label="actual")
# axes title
axes[0,1].set_title("Distribution of Log GDP per capita")
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Social_Support'].dropna(), ax=axes[1,0], shade=True, label="simulated")
sns.kdeplot(data2018['Social_Support'].dropna(), ax=axes[1,0], shade=True, label="simulated")
axes[1,0].set_title("Distribution of Social Support");
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Life_Expectancy'], ax=axes[1,1],shade=True, label="Simulated");
sns.kdeplot(data2018['Life_Expectancy'], ax=axes[1,1],shade=True, label="Simulated");
axes[1,1].set_title("Distribution of Simulated Life expectancy");
plt.tight_layout();
They are not perfect matches but it will do for now! Every time a numpy.random function is run it will produce different numbers from the same distribution unless the seed is set. Now I will just check the correlations and any relationships in the data.
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,4))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(y ="Life_Satisfaction",x="Log_GDP_per_cap", data=result, ax=axes[0])
sns.regplot(y ="Life_Satisfaction",x="Social_Support", data=result, ax=axes[1])
sns.regplot(y ="Life_Satisfaction",x="Life_Expectancy", data=result, ax=axes[2])
axes[2].set_ylim(0,10); axes[2].set_xlim(40,90)
plt.tight_layout();
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,4))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(y ="Life_Satisfaction",x="Log_GDP_per_cap", data=data2018, ax=axes[0])
sns.regplot(y ="Life_Satisfaction",x="Social_Support", data=data2018, ax=axes[1])
sns.regplot(y ="Life_Satisfaction",x="Life_Expectancy", data=data2018, ax=axes[2])
axes[2].set_ylim(0,10); axes[2].set_xlim(40,90)
plt.tight_layout();
# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(9,9))
# plot the distributions of each of the main variables. At global level first. Look at Regional after
sns.distplot(result['Life_Satisfaction'].dropna(), ax=axes[0,0], bins=10, color="r");
# set axes title
axes[0,0].set_title("Distribution of Simulated Life Ladder");
sns.distplot(result['Log_GDP_per_cap'].dropna(), ax=axes[0,1], bins=10, color="g");
axes[0,1].set_title("Distribution of Simulated Income");
sns.distplot(result['Social_Support'].dropna(), ax=axes[1,0], bins=10, color="b");
axes[1,0].set_title("Distribution of Simulated Social Support");
sns.distplot(result['Life_Expectancy'].dropna(), ax=axes[1,1], bins=10, color="y");
axes[1,1].set_title("Distribution of Simulated Life expectancy");
plt.tight_layout();
# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(9,9))
# plot the distributions of each of the main variables. At global level first. Look at Regional after
sns.distplot(data2018['Life_Satisfaction'].dropna(), ax=axes[0,0], bins=10, color="r");
# set axes title
axes[0,0].set_title("Distribution of Actual Life Ladder");
sns.distplot(data2018['Log_GDP_per_cap'].dropna(), ax=axes[0,1], bins=10, color="g");
axes[0,1].set_title("Distribution of Actual Income");
sns.distplot(data2018['Social_Support'].dropna(), ax=axes[1,0], bins=10, color="b");
axes[1,0].set_title("Distribution of Actual Social Support");
sns.distplot(data2018['Life_Expectancy'].dropna(), ax=axes[1,1], bins=10, color="y");
axes[1,1].set_title("Distribution of Actual Life expectancy");
plt.tight_layout();
data2018
The final simulated dataframe containing the 6 variables is called result
.
result
An alternative way to simulating the dataset could be to use the multivariate normal distribution function. While the variables are not all normally distributed at the dataset level, when broken down by the regions they do appear to be gaussian with each groups distributions having very different centres and spreads to each other.
There is a numpy.random.multivariate function that can generated correlated data.
I found a thread at stack overflow referencing how to use the multivariate function to [generate correlated data] (https://stackoverflow.com/a/16026231)[26] Here I filter the data for each group and calculate the covariance on the first three variables.
data2018.groupby('cluster_pred').describe()
# select clusters for each group
group0 =data2018.loc[data2018.cluster_pred=='Group0']
group1 =data2018.loc[data2018.cluster_pred=='Group1']
group2 =data2018.loc[data2018.cluster_pred=='Group2']
# get the covariances for each cluster group
group0cov = group0.iloc[:,:4].cov()
group1cov = group1.iloc[:,:4].cov()
group2cov = group2.iloc[:,:4].cov()
# get the mean of the variables for each group:
group0mean=group0.iloc[:,:4].mean()
group1mean=group1.iloc[:,:4].mean()
group2mean=group2.iloc[:,:4].mean()
# use number of samples based on number of countries in the grup
num_samples=31
# using group0 means of each variable
mu=group0mean
# using group0 covariances for each grouo
r=group0cov
# generate the random samples using `multivariate` function
g0=np.random.multivariate_normal(mu,r,size=num_samples)
# use number of samples based on number of countries in the grup
num_samples=39
# using group0 means of each variable
mu=group1mean
# using group0 covariances for each grouo
r=group1cov
# generate the random samples using `multivariate` function
g1=np.random.multivariate_normal(mu,r,size=num_samples)
g1.shape
# use number of samples based on number of countries in the grup
num_samples=56
# using group0 means of each variable
mu=group1mean
# using group0 covariances for each grouo
r=group1cov
# generate the random samples using `multivariate` function
g2=np.random.multivariate_normal(mu,r,size=num_samples)
g2.shape
# create dataframe from multivariate normal, u
MV0 =pd.DataFrame(g0)
# using countries from earlier as simulated earlier
MV0['Country']=G0_countries
# add a group column
MV0['Group']='group0'
# create dataframe
MV1 =pd.DataFrame(g1)
# using countries from earlier as simulated earlier
MV1['Country']=G1_countries
# add a group column
MV1['Group']='group1'
# create dataframe
MV2 =pd.DataFrame(g2)
# using countries from earlier as simulated earlier
MV2['Country']=G2_countries
# add a group column
MV2['Group']='group2'
# concat the three dataframes
MV = pd.concat([MV0,MV1,MV2])
# get the column names
MV.columns
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html#pandas-dataframe-rename
# rename the columns
MV=MV.rename(columns={0: "Life_Satisfaction", 1: "Log_GDP_per_cap",2:"Social_Support",3:"Life_Expectancy"})
MV
MV.corr()
data2018.iloc[:,:4].corr()
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,3))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(y ="Life_Satisfaction",x="Log_GDP_per_cap", data=MV, ax=axes[0])
sns.regplot(y ="Life_Satisfaction",x="Social_Support", data=MV, ax=axes[1])
sns.regplot(y ="Life_Satisfaction",x="Life_Expectancy", data=MV, ax=axes[2])
# change the axis limited for Life expectancy
axes[2].set_ylim(0,10); axes[2].set_xlim(40,90);
The aim of the project was to create a data set by simulating a real-world phenomenon of our choosing. Then rather than collect data related to the phenomenon, model and synthesis such data using Python and the numpy.random
package.
In this notebook I first analysed an actual dataset for the phenomenon of World Happiness in order to understand the type of variables and the type of distributions they were likely to have come from and also to see how they were related to each other.
I then simulated the dataset, mainly using the numpy.random
package but exploring some functionality from the scikit-learn package.
Finally I compared the results of the simulation to a real world dataset.
I believe the synthesised dataset closely matches the actual dataset. The results will change each time the code is run as the random seed is not set. In reality data samples would also vary from sample to sample due to randomness.
The dataset was a little more complicated that it first appeared to be, given the regional variation and also variation within regions. Given that the number of datapoints for a dataset such as this is limited, the number of datapoints that could be simulated is limited but I am satisfied with the results. Overall this project was a very good learning experience. It was also very eye-opening to see the levels of inequality across the globe.
The last section below lists the references used for this project.
pd.options.display.max_rows=8
# change the max numbers of rows to display the simulated dataset.
pd.options.display.max_rows=130
result
https://www.python.org/
https://seaborn.pydata.org/tutorial.html
https://pandas.pydata.org
https://docs.scipy.org/doc/numpy-1.16.0/reference/routines.random.html
[7] John F. Helliwell, Haifang Huang and Shun Wang. The Distribution of World Happiness
The Distribution of World Happiness
[10] John F. Helliwell, Haifang Huang and Shun Wang Statistical Appendix 1 for Chapter 2 of World Happiness Report 2019
[11] Website: The Balance
<https://economics.mit.edu>
Prof. Sharyn O’Halloran. Lecture 1: Data and Models.
http://www.columbia.edu
https://ec.europa.eu/eurostat
www.who.int
https://chrisalbon.com
https://en.wikipedia.org
[28] Crump, Matthew J. C. (2017). Programming for Psychologists: Data Creation and Analysis (Version 1.1). https://crumplab.github.io/programmingforpsych/
[29] Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. An Introduction to Statistical Learning. Online pdf corrected 7th edition