Having collected some data for the real-world phenomenon of the World Happiness Scores, I looked at the type of variables involved, their distributions and the relationships between the variables. In this section I will look more closely at the distributions of each of the variables and then simulate some data that hopefully will have properties that match the actual variables in the real sample dataset.

The aim of the World Happiness Scores is to see what countries or regions rank the highest in overall happiness and how each of the six factors (economic production, social support, life expectancy, freedom, absence of corruption, and generosity) contribute to happiness levels. The World Happiness Report researchers studied how the different factors contribute to the happiness scores and the extent of each effect. While these factors have no impact on the total score reported for each country, they were analysed to explain why some countries rank higher than others and they describe the extent to which these factors contribute in evaluating the happiness in each country.

As the differences in social support, incomes and healthy life expectancy are the three most important factors in determining the overall happiness score, this is what I focus on in this project.

Therefore I will simulate the following variables:

Regions and Countries
Life Ladder / Life Satisfaction
Log GDP per capita
Social Support
Healthy Life Expectancy at birth

Simulate Regions and countries.

There are countries from 10 geographic regions in the sample dataset for 2018. While the region was not in the actual dataset for 2018, I added it in from a dataset for a previous year (2015). This allowed me to see the how the values for each variable in the dataset differ by region.

Look at the clustered groups for 2018:

The rough clustering earlier divided the countries into 3 clustered regions. Of the clustered countries, 10 values were lost due to missing values. If I had more time then the missing values could be imputed based on previous years but for now there is at least 100 which is sufficient for this project.

#clusters for 2018 data
c18=clusters.loc[clusters.loc[:,'Year']==2018]
len(c18) # only 126 countries due to missing values in some rows
Cy=clusters # all the data for all years 
# the number of clusters for all years
len(Cy)
> 1094

Look at the distribution of countries across the geographic regions.

Below I look at the actual proportions of countries in each of the 10 geographic regions which are then plotted. The countplot shows the distribution of countries in the real dataset over the 10 regions. To avoid the axes labels overlapping, I have followed advice on a blogpost drawing from data[20] to rotate the axes labels. Also to show the plot by descending order of the region with the most countries.

# see how many countries in each Region
df18.Region.value_counts()
# find how many values in total
print(f"There are {len(df18.Region)} countries in the dataset. ")
# calculate the proportion of countries in each region
prob=(df18.Region.value_counts()/len(df18.Region)).round(3)
#print(f"These countries are split over 10 geographic regions as follows: \n{df18.Region.value_counts()/len(df18.Region)}.3f %", sep="\n")
# make sure they sum to 1. 
prob.sum()

>1.0
>There are 136 countries in the dataset.

# set up the matplotlib plot
plt.rcParams["figure.figsize"] = (10,4)
# set the tick size
plt.rcParams["xtick.labelsize"] = 10
sns.set_palette("pastel")
# https://stackoverflow.com/q/46623583 order countplot by count.
chart= sns.countplot(x=df18.Region, order= df18['Region'].value_counts().index)
# rotate axes labels to avoid overlap - see <https://www.drawingfromdata.com/how-to-rotate-axis-labels-in-seaborn-and-matplotlib>
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)
plt.title("Distribution of Countries in Geographic Regions in in the dataset 2018");

png

Distribution of countries by clustered region.

Here I will use the dataframe c18 created above which has the clustered regions added. All other data in it is the same as the original dataframes. The dataframe clusters contains all the data from df_yearss but with the addition of the predicted clusters.

# see the top of the dataframe
clusters.head()

	Year	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	RegionCode	cluster_pred
3	2011	3.831719	7.415019	0.521104	51.919998	7	1
4	2012	3.782938	7.517126	0.520637	52.240002	7	1
5	2013	3.572100	7.522238	0.483552	52.560001	7	1
6	2014	3.130896	7.516955	0.525568	52.880001	7	1
7	2015	3.982855	7.500539	0.528597	53.200001	7	1

# find how many values in total
print(f"There are {len(c18.cluster_pred)} countries in the dataset. ")
# calculate the proportion of countries in each region
CRprob=(c18.cluster_pred.value_counts()/len(c18.cluster_pred)).round(3)
print(f"These countries are split over 10 geographic regions as follows: \n{c18.cluster_pred.value_counts()/len(c18.cluster_pred)}.3f %", sep="\n")
# mnake sure they sum to 1. 
CRprob.sum()

  There are 126 countries in the dataset. 
  These countries are split over 10 geographic regions as follows: 
  0    0.444444
  2    0.309524
  1    0.246032
  Name: cluster_pred, dtype: float64.3f %

  1.0

Plot the distribution of countries in the clustered groups:

sns.set_palette("pastel")
# https://stackoverflow.com/q/46623583 order countplot by count.
chart= sns.countplot(x=c18.cluster_pred, order= c18['cluster_pred'].value_counts().index)
# rotate axes labels to avoid overlap - see <https://www.drawingfromdata.com/how-to-rotate-axis-labels-in-seaborn-and-matplotlib>
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)
plt.title("Distribution of Countries in Geographic Regions in in the dataset 2018");

png

Simulate some geographic regions.

Here I use the numpy.random.choice function to generates a random sample of regions from a one dimensional array of made-up regions using the proportions from the real dataset.

# make up some region names:
Sim_regions = ['SimReg0','SimReg1','SimReg2']
# assign countries to region based on proportions in actual dataset.
p=[c18.cluster_pred.value_counts()/len(c18.cluster_pred)]
# simulate regions using the probabilities based on proportion in actual dataset
CR= np.random.choice(Sim_regions, 136, p=[p[0][0],p[0][1],p[0][2]])
CR

array(['SimReg2', 'SimReg1', 'SimReg2', ..., 'SimReg1', 'SimReg0',
           'SimReg0'], dtype='<U7')

Make up some country names:

Here I am just going to create some country names by appending a digit to the string ‘Country_’. I don’t have the imagination to start creating over 100 new countries!

# create a list of countries by appending number to 'country'
countries =[]
# use for loop and concatenate number to string
for i in range(136):
    countryname = 'Sim_Country_'+str(i)
    # append countryname  to list of countries
    countries.append(countryname)

Create a dataframe to hold the simulated data:

I now have countries and regions for the simulated dataset which I will add to a dataframe.

# create a dataframe to hold the simulated variables.
sim_df = pd.DataFrame(data={'Sim_Country':countries, 'Sim_Region':CR})
sim_df

	Sim_Country	Sim_Region
0	Sim_Country_0	SimReg2
1	Sim_Country_1	SimReg1
2	Sim_Country_2	SimReg2
3	Sim_Country_3	SimReg0
...	...	...
132	Sim_Country_132	SimReg1
133	Sim_Country_133	SimReg1
134	Sim_Country_134	SimReg0
135	Sim_Country_135	SimReg0

136 rows × 2 columns

Plot of the distribution of simulated regions:

# set the figure size
plt.rcParams["figure.figsize"] = (10,4)
sns.set_palette("muted")
# countplot of simulated region, order by counts
sns.countplot(x=sim_df.Sim_Region, order= sim_df['Sim_Region'].value_counts().index)
# add title
plt.title("Distribution of Simulated Regions in the Simulated dataset");

png

The countplot above shows the simulated regions in the simulated dataset, ordered by the count of the countries in each region. Comparing it to the actual dataset shows that the split looks good.

Simulate Life Satisfaction / Life Ladder variable:

The Life Ladder scores and rankings are based on answers to the Cantril Ladder question from the Gallup World Poll where the participants rate their own lives on a scale of 0 to 10 with the best possible life for them being a 10 and the worst a 0. The rankings are from nationally representative samples for the years between 2016 and 2018 and are based entirely on the survey scores using weights to make the estimates representative.

The World Happiness Report shows how levels of GDP, life expectancy, generosity, social support, freedom, and corruption - contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. The report itself looks at life evaluations for the 2016 to 2018 period and the rankings across all countries in the study. Here are the statistics for the Life Ladder variable for the 2018 dataset.

Life Ladder / Life Satisfaction is a non-negative real number with statistics as follows. The minimum possible value is 0 and the maximum is 10 although the range of values in the actual dataset is smaller. The values in the dataset are national averages of the answers to Cantril Ladder question. The range of values vary across regions.

Descriptive statistics of the Life Ladder variable in the dataset for 2018:

df18['Life_Satisfaction'].describe()

count    136.000000
mean       5.502134
std        1.103461
min        2.694303
25%        4.721326
50%        5.468088
75%        6.277691
max        7.858107
Name: Life_Satisfaction, dtype: float64

Plot of the distribution of Life Ladder for 2018:

The distribution of Life Ladder is plotted below. The plot on the left shows the distribution for the dataframe containing 2018 data only and the plot on the right shows the distribution for the dataframe containing all years.

f,axes=plt.subplots(1,2, figsize=(12,3))
# distribution of the Life Ladder for 2018
sns.distplot(df18['Life_Satisfaction'], label="Life Ladder 2018", ax=axes[0])
# distribution of life ladder from the dataframe with the clusters. some missing values
sns.distplot(c18['Life_Satisfaction'], label="Life Ladder - 2018", ax=axes[0])
axes[0].set_title("Distribution of Life Ladder for 2018 only")
#axes[0].legend()
# kdeplot of the life ladder for all years in the extended dataset (WHR Table2.1)
sns.distplot(df_years['Life_Satisfaction'], label="Life Ladder - all years", ax=axes[1])
sns.distplot(c18['Life_Satisfaction'], label="Life Ladder - 2018",ax=axes[1])
axes[1].set_title("Distribution of Life Ladder for 2012 to 2018");
#axes[1].legend();

png

The distribution plots for the Life Ladder variable does suggest that it is normally distributed but there are some tests that can be used to clarify this. A blogpost on tests for normality[21] at machinelearningmastery.com outlines how it is important when working with a sample of data to know whether to use parametric or nonparametric statistical methods. If methods used assume a Gaussian distribution when it is not the case then findings can be incorrect or misleading. In some cases it is enough to assume the data is normal enough to use parametric methods or to transform the data to be normal enough.

Parametric statistical methods assume that the data has a known and specific distribution, often a Gaussian distribution. If a data sample is not Gaussian, then the assumptions of parametric statistical tests are violated and nonparametric statistical methods must be used.[21]

Normality tests can be used to check if your data sample is from a Gaussian distribution or not.

Statistical tests calculate statistics on the data to quantify how likely it is that the data was drawn from a normal distribution.
Graphical methods plot the data to qualitatively evaluate if the data looks Gaussian.

Tests for normality include the Shapiro_Wilk Normality Test, the D’Agostino and Pearson’s Test, the Anderson-Darling Test.

The histograms above show the characteristic bell-shape curve of the normal distribution and indicates that the Life Ladder variable is gaussian or approximately normally distributed.

The Quantile-Quantile Plot (QQ plot) is another plot that can be used to check if the distribution of a data sample. It generates its own sample of the idealised distribution that you want to compare your sample of data to. The idealised samples are divided into groups and each data point in the sample is paired with a similar member from the idealised distribution at the same cumulative distribution.

A scatterplot is drawn with the idealised values on the x-axis and the data sample on the y-axis. The resulting points are plotted as a scatter plot with the idealized value on the x-axis and the data sample on the y-axis. If the result is a straight line of dots on the diagonal from the bottom left to the top right this indicates a perfect match for the distribution whereas if the dots deviate far from the diagonal line.

QQ plot of Life Ladder:

Here a QQ plot is created for the Life Ladder sample compared to a Gaussian distribution (the default).

# import the qqplot function 
from statsmodels.graphics.gofplots import qqplot
# Life Ladder observations from the dataset. can use c18 dataframe or dfh
data = c18['Life_Satisfaction']
# plot a q-q plot, draw the standardised line
qqplot(data, line='s') 
plt.title("QQ plot to test for normality of Life Ladder");

png

Tests for normality:

The QQ plot does seem to indicate to me that the Life Ladder is normally distributed. The scatter plots of points do mostly follow the diagonal pattern for a sample from a Gaussian distribution. I will use some of the normality tests as outlined on the blogpost on tests for normality[21]. These tests assume the sample was drawn from a Gaussian distribution and test this as the null hypothesis. A test-statistic is calculated and a p-value to interpret the test statistic. A threshold level called alpha is used to interpret the test. This is typically 0.05. If the p-value is less than or equal to alpha, then reject the null hypothesis. If the p-value is greater than alpha then fail to reject the null hypothesis. In general a larger p-value indicates that the sample was likely to have been drawn from a Gaussian distribution. A result above 5% doesn’t mean the null hypothesis is true but that it is very likely true given the evidence available.

Shapiro-Wilk Normality Test

# adapted from https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
# import the shapiro test from scipy stats
from scipy.stats import shapiro
# calculate the test statistic and the p-value to interpret the test on the Life Ladder sample
stat, p = shapiro(df18['Life_Satisfaction'])
# interpret the test
alpha= 0.05
print('stat=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
    print('The sample of Life Satisfaction/ Life Ladder looks Gaussian (fail to reject H0)')
else:
    print('The sample of Life Ladder does not look Gaussian (reject H0)')

  stat=0.991, p=0.520
  The sample of Life Satisfaction/ Life Ladder looks Gaussian (fail to reject H0)

D’Agostino’s K^2 Test

This test calculates the kurtosis and skewness of the data to see if the data distribution is not normal like. The skew is a measure of asymmetry in the data while hurtosis how much of the distribution is in the tails.

# D'Agostino and Pearson's Test adapted from https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
# import from scipy.stats
from scipy.stats import normaltest
# normality test
stat, p = normaltest(df18['Life_Satisfaction'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('The sample of Life Ladder looks Gaussian (fail to reject H0)')
else:
    print('The sample of Life Ladder does not look Gaussian (reject H0)')

Statistics=2.435, p=0.296
The sample of Life Ladder looks Gaussian (fail to reject H0)

Anderson-Darling Test:

This is a statistical test that can be used to evaluate whether a data sample comes from one of among many known data samples adn can be used to check whether a data sample is normal. The test is a modified version of the Kolmogorov-Smirnov test[22] which is a nonparametric goodness-of-fit statistical test. The Anderson-Darling test returns a list of critical values rather than a single p-value. The test will check against the Gaussian distribution (dist=’norm’) by default.

# Anderson-Darling Test - adapted from https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/

from scipy.stats import anderson
# normality test
result = anderson(df18['Life_Satisfaction'])
print('Statistic: %.3f' % result.statistic)
p = 0
# checking for a range of critical values
for i in range(len(result.critical_values)):
    sl, cv = result.significance_level[i], result.critical_values[i]
    if result.statistic < result.critical_values[i]:
        # fail to reject the null that the data is normal if test statistic is less than the critical value.
        print('%.3f: %.3f, data looks normal (fail to reject H0)' % (sl, cv))
    else:
        # reject the null that the data is normal if test statistic is greater than the critical value.
        print('%.3f: %.3f, data does not look normal (reject H0)' % (sl, cv))

Statistic: 0.235
15.000: 0.560, data looks normal (fail to reject H0)
10.000: 0.638, data looks normal (fail to reject H0)
5.000: 0.766, data looks normal (fail to reject H0)
2.500: 0.893, data looks normal (fail to reject H0)
1.000: 1.062, data looks normal (fail to reject H0)

The sample of Life Ladder for 2018 does appear to be normally distributed based on the above tests. Therefore I can go ahead and simulate data for this variable using the normal distribution using the sample mean and standard deviation statistics.

Simulate Life ladder variable:

I will now simulate ‘Life Ladder’ using the numpy.random.normal function. This takes 3 arguments - loc for the mean or centre of the distributon, scale for the spread or width of the distribution - the standard deviation, and size for the number of samples to draw from the distribution. Without setting a seed the actual values will be different each time but the shape of the distribution should be the same.

print(f" The mean of the sample dataset for 2018 is {c18['Life_Satisfaction'].mean():.3f} ")
print(f" The standard deviation of the sample dataset for 2018 is {c18['Life_Satisfaction'].std():.3f} ")
c18.shape

  The mean of the sample dataset for 2018 is 5.500 
  The standard deviation of the sample dataset for 2018 is 1.098 
  
  (126, 7)

df_years.loc[df_years.loc[:,'Year']==2016]

	Year	Country	Region	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	RegionCode
8	2016	Afghanistan	Southern Asia	4.220169	7.497038	0.559072	53.000000	7
19	2016	Albania	Central and Eastern Europe	4.511101	9.337532	0.638411	68.099998	2
26	2016	Algeria	Middle East and Northern Africa	5.340854	9.541166	0.748588	65.500000	5
43	2016	Argentina	Latin America and Caribbean	6.427221	9.830088	0.882819	68.400002	4
...	...	...	...	...	...	...	...	...
1665	2016	Vietnam	Southeastern Asia	5.062267	8.672080	0.876324	67.500000	6
1676	2016	Yemen	Middle East and Northern Africa	3.825631	7.299221	0.775407	55.099998	5
1688	2016	Zambia	Sub-Saharan Africa	4.347544	8.203072	0.767047	54.299999	1
1701	2016	Zimbabwe	Sub-Saharan Africa	3.735400	7.538829	0.768425	54.400002	1

142 rows × 8 columns

Creating subsets of the data for each in order to be able to see how the distribution varies from year to year.

# subset of the data for each year
c17= Cy.loc[Cy.loc[:,'Year']==2017]
c16= Cy.loc[Cy.loc[:,'Year']==2016]
c15= Cy.loc[Cy.loc[:,'Year']==2015]
c14= Cy.loc[Cy.loc[:,'Year']==2014]

# plot the distribution of life satisfaction variable for each of the years 2014 to 2018
sns.kdeplot(c14['Life_Satisfaction'], label="2014")
sns.kdeplot(c15['Life_Satisfaction'], label="2015")
sns.kdeplot(c16['Life_Satisfaction'], label="2016")
sns.kdeplot(c17['Life_Satisfaction'], label="2017")
sns.kdeplot(c18['Life_Satisfaction'], label="2018")
plt.title("Distribution of Life Satisfaction for each year 2014 to 2017");

png

Plot some simulated data.

Here some data is simulated for the Life Satisfaction/Ladder variable is simulated using the mean and standard deviation for the 2018 data. As the variable appears to be normally distributed it would be ok to do this. However as mnay of the other varaibles are not normally distributed at the dataset level but do look gaussian at the clustered region level I will be simulating the other variables this way.

sns.kdeplot(c18['Life_Satisfaction'], label="Life Satisfaction 2018", shade="True")
# simulate data based on statistics from the 2018 sample dataset
# simulate life ladder using sample mean and standard deviation
for i in range(10):
        x= np.random.normal(c18['Life_Satisfaction'].mean(),c18['Life_Satisfaction'].std(),130)
        sns.kdeplot(x)
# plot the distribution of the actual dataset

# plot the simulated data 
plt.title("Actual data versus the Simulated data for Life Ladder 2018")
plt.legend();

png

The above plots of the distribution of Life_Satisfaction illustrate how the distribution changes slightly for each simulation due to the randomness.

Next plot the distribution for the 4 clustered regions.

c18.groupby('cluster_pred').count()

	Year	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	RegionCode
cluster_pred
0	56	56	56	56	56	56
1	31	31	31	31	31	31
2	39	39	39	39	39	39

The distributions of Life Ladder for each of the 3 clustered regions.

# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)

# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Life_Satisfaction'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Life_Satisfaction'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Life_Satisfaction'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Life Ladder for the 3 clusters for 2017");
plt.legend();

png

# separate out the clustered regions for all years from 2011 to 2018
cluster0 =clusters.loc[clusters.loc[:,'cluster_pred']==0]
cluster1 =clusters.loc[clusters.loc[:,'cluster_pred']==1]
cluster2 =clusters.loc[clusters.loc[:,'cluster_pred']==2]
## separate out the clustered regions for 2018 data
c18_0 =c18.loc[c18.loc[:,'cluster_pred']==0]
c18_1 =c18.loc[c18.loc[:,'cluster_pred']==1]
c18_2 =c18.loc[c18.loc[:,'cluster_pred']==2]

The distribution of the Life Satisfaction (Life Ladder) variable looks quite normal when taken over all the countries in the dataset. When shown for each of the 3 clustered regions, the distributions are all still normal looking but with very different shapes and locations.

Mean and Standard deviation for each region group:

# cluster 0
print(f" The mean of the sample dataset for 2018 is {cluster0['Life_Satisfaction'].mean():.3f} ")
print(f" The standard deviation of the sample dataset for 2018 is {cluster0['Life_Satisfaction'].std():.3f} ")
c18.shape

 The mean of the sample dataset for 2018 is 5.410 
 The standard deviation of the sample dataset for 2018 is 0.741 

 (126, 7)

Simulate for each cluster based on their statistics

LLc0= np.random.normal(c18_0['Life_Satisfaction'].mean(),c18_0['Life_Satisfaction'].std(),31)
LLc1= np.random.normal(c18_1['Life_Satisfaction'].mean(),c18_1['Life_Satisfaction'].std(),39)
LLc2= np.random.normal(c18_2['Life_Satisfaction'].mean(),c18_2['Life_Satisfaction'].std(),56)

# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(LLc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(LLc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(LLc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Life Satisfaction for 3 region clusters");

png

Illustrating how the distributions changed for each simulation due to random variation.

f,axes=plt.subplots(1,3, figsize=(13,3))
for i in range(5):
        sns.kdeplot(np.random.normal(c18_0['Life_Satisfaction'].mean(),c18_0['Life_Satisfaction'].std(),31), shade=True, ax=axes[0])
    
for i in range(5):
        sns.kdeplot(np.random.normal(c18_1['Life_Satisfaction'].mean(),c18_1['Life_Satisfaction'].std(),39), shade=True, ax=axes[1])
for i in range(5):
        sns.kdeplot(np.random.normal(c18_2['Life_Satisfaction'].mean(),c18_2['Life_Satisfaction'].std(),59), shade=True, ax=axes[2])

plt.suptitle("Distributions of 5 simulations for Life Satisfaction for each of the 3 groups")

Text(0.5, 0.98, 'Distributions of 5 simulations for Life Satisfaction for each of the 3 groups')

png

The variable ‘Social support’ was the result of a question in the Gallop World Poll with the national average of the binary responses for each country to the GWP question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”.

The distribution from the dataset shows that it is left skewed. The scatterplots above showed that it is positively correlated with life satisfaction, income levels and healthy life expectancy. The boxplots below show that the median values fall into roughly 3 groups by regions with Western Europe, North America and Australia and New Zealand having the highest scores, while Southern Asia and Sub Saharan Africa have the lowest median scores and a wider spread.

Like Life ladder variable, the values are the result of national averages to questions in the Gallup World Poll. In this case the question had a binary answer 0 or 1 and therefore the actual values in the dataset range between 0 and 1. Social Support is a non-negative real values. The statistics vary from region to region.

df18['Social_Support'].describe()

count    136.000000
mean       0.810544
std        0.116332
min        0.484715
25%        0.739719
50%        0.836641
75%        0.905608
max        0.984489
Name: Social_Support, dtype: float64

f,axes=plt.subplots(1,2, figsize=(12,3))
# distribution of the Social support for 2018
sns.distplot(df18['Social_Support'], label="Social Support 2018", ax=axes[0])
# distribution of social support from the dataframe with the clusters. some missing values
sns.distplot(c18['Social_Support'], label="Social Support - 2018", ax=axes[0])
axes[0].set_title("Distribution of Social Support for 2018 only")
#axes[0].legend()
# kdeplot of the Social Support for all years in the extended dataset (WHR Table2.1)
sns.distplot(df_years['Social_Support'].dropna(), label="Social_Support - all years", ax=axes[1])
sns.distplot(c18['Social_Support'], label="Social_Support - 2018",ax=axes[1])
axes[1].set_title("Distribution of Social_Support for 2012 to 2018");
#axes[1].legend();

png

# plot the distribution of life satisfaction variable for each of the years 2014 to 2018
sns.kdeplot(c14['Social_Support'], label="2014")
sns.kdeplot(c15['Social_Support'], label="2015")
sns.kdeplot(c16['Social_Support'], label="2016")
sns.kdeplot(c17['Social_Support'], label="2017")
sns.kdeplot(c18['Social_Support'], label="2018")
plt.title("Distribution of Social_Support' for each year 2014 to 2017");

png

# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)

# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Social_Support'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Social_Support'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Social_Support'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Social Support variable for the 3 clusters for 2017");
plt.legend();

png

The distribution of Social Support for all the countries in 2018 is a left-skewed distribution with a long tail to the left. When this is broken down by cluster groups of regions, the distribution of each cluster looks more normal shaped although the centres of the distribution and the spread vary widely. Cluster region 0 (green) has social support values centred around different values.

c18.groupby('cluster_pred').mean()

	Year	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	RegionCode
cluster_pred
0	2018.0	5.322920	9.294176	0.820008	64.951786	3.946429
1	2018.0	4.415971	7.732756	0.669998	54.945161	1.290323
2	2018.0	6.614540	10.401640	0.900387	71.515385	4.102564

SSc0= np.random.normal(c18_0['Social_Support'].mean(),c18_0['Social_Support'].std(),31)
SSc1= np.random.normal(c18_1['Social_Support'].mean(),c18_1['Social_Support'].std(),39)
SSc2= np.random.normal(c18_2['Social_Support'].mean(),c18_2['Social_Support'].std(),56)

# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(SSc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(SSc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(SSc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Social Support for 3 region clusters");

png

When I first looked at simulating the social support variable I considered non-parametric ways of working with the data such as the bootstrap resampling method. Bootstrapping is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples. Samples are constructed by drawing observations from a large data sample one at a time and then returning the observation so that it could be drawn again. Therefore any given observation could be included in the sample more than once while some observations might never be drawn. I referred to a blogpost on Machinelearningmastery.com[23] which outlines the steps to implement it using the scikit-learn resample function[24] which takes as arguments the data array, whether or not to sample with replacement, the size of the sample, and the seed for the pseudorandom number generator used prior to the sampling.

Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution function of the observed data. In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples with replacement, of the observed dataset (and of equal size to the observed dataset). The basic idea of bootstrapping is that inference about a population from sample data can be modelled by resampling the sample data and performing inference about a sample from resampled data. As the population is unknown, the true error in a sample statistic against its population value is unknown. In bootstrap-resamples, the ‘population’ is in fact the sample, and this is known; hence the quality of inference of the ‘true’ sample from resampled data (resampled → sample) is measurable. Wikipedia wiki on Bootstrapping[25]

The process for building one sample is:

choose the size of the sample
While the size of the sample is less than the chosen size
1. Randomly select an observation from the dataset
2. Add it to the sample The number of repetitions must be large enough that meaningful repetitions can be calculated on the sample.

Based on the information above I use sampling with replacement to simulate the Social support variable using the sample size the same as the original dataset which is 136 observations. numpy.random.choice function can be used for this purpose. The mean of the Social support variable in the dataset can be considered as a single estimate of the mean of the population of social support while the standard deviation is an estimate of the variability. The simplest bootstrap method would involve taking the original data set of N(136) Social support values and sampling from it to form a new sample - the ‘resample’ or bootstrap sample that is also of size 136. If the bootstrap sample is formed using sampling with replacement from the original data sample with a large enough size then there should be very little chance that the bootstrap sample will be the exact same as the original sample. This process is repeated many times (thousands) and the mean computed for each bootstrap sample to get the bootstrap estimates which can then be plotted on a histogram which is considered an estimate of the shape of the distribution.

Sampling with replacement using np.random.choice:

Here I use a loop to draw multiple random samples with replacements from the dataset. Additionally the means for each sample can be calculated and then all the means from the different samples are plotted to show their distribution but for this project I just need a simulated data set.

# using the Social support data from the dataset.
social=df18['Social_Support']
# create a list to store the means of the samples, set the number of simulations
mean_social_sim, sims = [], 20
# use loop to create 100 samples - takes very long to do 1000 

# plot the distribution of social support variable from the dataset, add title
sns.kdeplot(social, label="Actual Data", shade=True)
for _ in range(sims):
    # draw a random sample from social with replacement and store it in social_sample
    social_sample=np.random.choice(social, replace=True, size=len(social))

    # plot the distribution of each sample
    sns.kdeplot(social_sample)
    # add title
    plt.title("Simulating using random sample with replacement")
    social_mean = np.mean(social_sample)
    # append the mean of each sample to mean_social
    mean_social_sim.append(social_mean)

png

The main use of the bootstrap is making inferences about an estimate for a population parameter on sample data. For the purposes of this project I just need to simulate a single dataset. However by calculating the bootstrap means and comparing them to the dataset I am attempting to replicate shows that it is a suitable method here when the data does not follow a particular distribution.

df18['Social_Support'].describe()

count    136.000000
mean       0.810544
std        0.116332
min        0.484715
25%        0.739719
50%        0.836641
75%        0.905608
max        0.984489
Name: Social_Support, dtype: float64

np.mean(mean_social_sim)

0.8080325175953262

Simulating Log GDP per capita

The GDP per capita in the World Happiness Report dataset are in purchasing power parity at constant 2011 international dollar prices which are mainly from the World Development Indicators in 2018. Purchasing power parity is necessary when looking to compare GDP per capita between countries which is what the World Happiness Report seeks to do. Nominal GDP would be fine when just looking at a single country. The log of the GDP figures is taken.

Per capita GDP is the Total Gross Domestic Product for a country divided by its population and breaks down a country’s GDP per person. As the World Happiness Report states this is considered a universal measure for gauging the prosperity of nations.

The earlier plots showed that the distribution of log GPD per capita in the dataset is not normally distributed. Per capita GDP is generally a unimodal but skewed distribution. There are many regional variations in income across the world. The distribution of Log GPD per capita appears somewhat left skewed.

The scatterplots above showed that it is positively correlated with life satisfaction, social support and healthy life expectancy.

The log of GDP per capita is used to represent the Gross Domestic Product per capita. It is a non-negative real value. Because the log of the GDP per capita figures are used the range is between 6 and 12 although this varies from region to region. The

df['Log_GDP_per_cap'].describe()

count    1676.000000
mean        9.222456
std         1.185794
min         6.457201
25%         8.304428
50%         9.406206
75%        10.193060
max        11.770276
Name: Log_GDP_per_cap, dtype: float64

f,axes=plt.subplots(1,2, figsize=(12,3))

# distribution of social support from the dataframe with the clusters. some missing values
sns.distplot(c18['Log_GDP_per_cap'].dropna(), label="Log_GDP_per_cap - 2018", ax=axes[0])
axes[0].set_title("Distribution of Log_GDP_per_cap for 2018 only")
#axes[0].legend()
# kdeplot of the Social Support for all years in the extended dataset (WHR Table2.1)
sns.distplot(clusters['Log_GDP_per_cap'].dropna(), label="Log_GDP_per_cap - all years", ax=axes[1])
axes[1].set_title("Distribution of Log_GDP_per_cap for 2012 to 2018");
#axes[1].legend();

png

# plot the distribution of Log GDP per capita variable for each of the years 2014 to 2018
sns.kdeplot(c14['Log_GDP_per_cap'], label="2014")
sns.kdeplot(c15['Log_GDP_per_cap'], label="2015")
sns.kdeplot(c16['Log_GDP_per_cap'], label="2016")
sns.kdeplot(c17['Log_GDP_per_cap'], label="2017")
sns.kdeplot(c18['Log_GDP_per_cap'], label="2018")
plt.title("Distribution of Log_GDP_per_cap' for each year 2014 to 2017");

png

# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)

# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Log_GDP_per_cap'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Log_GDP_per_cap'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Log_GDP_per_cap'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Log_GDP_per_cap variable for the 3 clusters for 2017");
plt.legend();

png

When the distribution of Log GDP per capita is broken down by cluster groups of regions, the distribution of each cluster looks more normal shaped although the centres of the distribution and the spread vary widely as with the other variables.

Mean and standard deviation for Log GDP per capita by clustered group:

c18.groupby('cluster_pred').mean()

	Year	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	RegionCode
cluster_pred
0	2018.0	5.322920	9.294176	0.820008	64.951786	3.946429
1	2018.0	4.415971	7.732756	0.669998	54.945161	1.290323
2	2018.0	6.614540	10.401640	0.900387	71.515385	4.102564

Simulate data for the Log GDP per capita variable for each of the cluster groups:

GDPc0= np.random.normal(c18_0['Log_GDP_per_cap'].mean(),c18_0['Log_GDP_per_cap'].std(),31)
GDPc1= np.random.normal(c18_1['Log_GDP_per_cap'].mean(),c18_1['Log_GDP_per_cap'].std(),39)
GDPc2= np.random.normal(c18_2['Log_GDP_per_cap'].mean(),c18_2['Log_GDP_per_cap'].std(),56)

Plot the distributions of the simulated Log GDP per capita variable for the 3 groups:

# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(GDPc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(GDPc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(GDPc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Log GDP per capita for 3 region clusters");

png

Healthy Life Expectancy

According to the World Happiness Report, Healthy life expectancies at birth are based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository. Some interpolation and exterpolation was used to get the data for the years covered in the 2018 report. As some countries were not covered in the WHO data, other sources were used by the researchers.

The values are non-negative real numbers between 48 and 77 years but this varies from year to year as the statistics showed.

df18['Life_Expectancy'].describe()

count    132.000000
mean      64.670832
std        6.728247
min       48.200001
25%       59.074999
50%       66.350002
75%       69.075001
max       76.800003
Name: Life_Expectancy, dtype: float64

c18.columns

Index(['Year', 'Life_Satisfaction', 'Log_GDP_per_cap', 'Social_Support',
       'Life_Expectancy', 'RegionCode', 'cluster_pred'],
      dtype='object')

f,axes=plt.subplots(1,2, figsize=(12,3))

# distribution of Life_Expectancy from the dataframe with the clusters. some missing values
sns.distplot(c18['Life_Expectancy'].dropna(), label="Life_Expectancy - 2018", ax=axes[0])
axes[0].set_title("Distribution of Life_Expectancy for 2018 only")
#axes[0].legend()
# kdeplot of the Life_Expectancy from 2012 to 2018)
sns.distplot(clusters['Life_Expectancy'].dropna(), label="Life_Expectancy - all years", ax=axes[1])
axes[1].set_title("Distribution of Life_Expectancy for 2012 to 2018");
#axes[1].legend();

png

# plot the distribution of Life_Expectancy variable for each of the years 2014 to 2018
sns.kdeplot(c14['Life_Expectancy'], label="2014")
sns.kdeplot(c15['Life_Expectancy'], label="2015")
sns.kdeplot(c16['Life_Expectancy'], label="2016")
sns.kdeplot(c17['Life_Expectancy'], label="2017")
sns.kdeplot(c18['Life_Expectancy'], label="2018")
plt.title("Distribution of Life_Expectancy' for each year 2014 to 2017");

png

# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)

# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Life_Expectancy'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Life_Expectancy'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Life_Expectancy'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Life_Expectancy variable for the 3 clusters for 2018");
plt.legend();

png

Looking at the Life Expectancy variable by cluster group, the distribution looks normal for the cluster region 1 while the other two groups are less so. There seems to be two peaks in the distribution for cluster group 0 which probably needs to be broken out more.

Mean and standard deviation for Healthy Life Expectancy at birth by clustered group:

c18.groupby('cluster_pred').mean()

	Year	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	RegionCode
cluster_pred
0	2018.0	5.322920	9.294176	0.820008	64.951786	3.946429
1	2018.0	4.415971	7.732756	0.669998	54.945161	1.290323
2	2018.0	6.614540	10.401640	0.900387	71.515385	4.102564

Simulate data for the Life Expectancy variable for each of the cluster groups:

LEc0= np.random.normal(c18_0['Life_Expectancy'].mean(),c18_0['Life_Expectancy'].std(),31)
LEc1= np.random.normal(c18_1['Life_Expectancy'].mean(),c18_1['Life_Expectancy'].std(),39)
LEc2= np.random.normal(c18_2['Life_Expectancy'].mean(),c18_2['Life_Expectancy'].std(),56)

Plot the distributions of the simulated Healthy Life Expectancy at birth variable for the 3 groups:

# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(LEc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(LEc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(LEc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Life_Expectancy for 3 region clusters");

png

Create a dataset with the simulated variables.

Now that the individual variables have been simulated, the next step is to assemble the variables into a dataframe and ensure that the overall distributions and relationships are similar to an actual dataset. Because of the variations between the three groups I will create a dataframe for each group and then concatenate the three dataframes together.

Simulated Dataframe for Group0:

Here I create 31 countries and add the simulated data created earlier using numpy.random functions.

# create a list of countries by appending number to 'country'
G0_countries =[]
# use for loop and concatenate number to string
for i in range(31):
    countryname = 'Sim_Country_'+str(i)
    # append countryname  to list of countries
    G0_countries.append(countryname)

# create a dataframe and add the columns
G0=pd.DataFrame()
G0['Country']=G0_countries

# add columns based on simulations
G0['Life_Satisfaction']=LLc0
G0['Log_GDP_per_cap']=GDPc0
G0['Social_Support']=SSc0
G0['Life_Expectancy']=LEc0
# add group column
G0['Group']='Group0'

G0

	Country	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	Group
0	Sim_Country_0	5.306706	8.921793	0.787441	66.048883	Group0
1	Sim_Country_1	5.651307	8.654395	0.809210	66.436883	Group0
2	Sim_Country_2	5.661618	9.308995	0.693635	63.622131	Group0
3	Sim_Country_3	4.861526	10.605480	0.782290	64.910602	Group0
...	...	...	...	...	...	...
27	Sim_Country_27	5.131967	8.575882	0.721815	65.468924	Group0
28	Sim_Country_28	6.445590	8.746669	1.001303	61.517664	Group0
29	Sim_Country_29	5.824301	9.300203	0.848021	62.704594	Group0
30	Sim_Country_30	5.016133	9.535686	0.969046	64.939119	Group0

31 rows × 6 columns

Simulated dataframe for group 1

Here I create a dataframe for the 39 group 1 simulated data points.

# create a list of countries by appending number to 'country'
G1_countries =[]
# use for loop and concatenate number to string
for i in range(39):
    countryname = 'Sim_Country_'+str(i)
    # append countryname  to list of countries
    G1_countries.append(countryname)

# create a dataframe and add the columns
G1=pd.DataFrame()
G1['Country']=G1_countries
# add columns based on simulations
G1['Life_Satisfaction']=LLc1
G1['Log_GDP_per_cap']=GDPc1
G1['Social_Support']=SSc1
G1['Life_Expectancy']=LEc1
# add group column
G1['Group']='Group1'
G1

	Country	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	Group
0	Sim_Country_0	4.671344	6.724028	0.641005	50.339410	Group1
1	Sim_Country_1	4.761840	7.475304	0.676521	54.053141	Group1
2	Sim_Country_2	5.851653	8.594777	0.591575	52.256857	Group1
3	Sim_Country_3	4.299704	9.815454	0.723505	54.739286	Group1
...	...	...	...	...	...	...
35	Sim_Country_35	4.904859	8.355346	0.576573	58.568376	Group1
36	Sim_Country_36	5.592356	6.747997	0.658175	59.476083	Group1
37	Sim_Country_37	4.909638	7.444697	0.721786	51.673735	Group1
38	Sim_Country_38	4.673152	7.425028	0.586048	55.680735	Group1

39 rows × 6 columns

Simulated dataframe for group 2

Here I create a dataframe for the 56 group 1 simulated data points.

# create a list of countries by appending number to 'country'
G2_countries =[]
# use for loop and concatenate number to string
for i in range(56):
    countryname = 'Sim_Country_'+str(i)
    # append countryname  to list of countries
    G2_countries.append(countryname)

# create a dataframe and add the columns
G2=pd.DataFrame()
G2['Country']=G2_countries
# add columns based on simulations
G2['Life_Satisfaction']=LLc2
G2['Log_GDP_per_cap']=GDPc2
G2['Social_Support']=SSc2
G2['Life_Expectancy']=LEc2
# add group column
G2['Group']='Group2'
G2

	Country	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	Group
0	Sim_Country_0	6.650582	10.486408	0.957040	71.204546	Group2
1	Sim_Country_1	5.948507	10.530889	0.832581	70.661832	Group2
2	Sim_Country_2	7.115427	9.467759	0.945339	69.148849	Group2
3	Sim_Country_3	5.896939	11.135984	0.843904	73.312323	Group2
...	...	...	...	...	...	...
52	Sim_Country_52	6.862937	10.019423	0.928102	71.754594	Group2
53	Sim_Country_53	5.862264	10.268521	0.907081	71.617507	Group2
54	Sim_Country_54	6.288760	11.645804	0.942329	76.936191	Group2
55	Sim_Country_55	6.509125	10.598213	0.901340	70.859460	Group2

56 rows × 6 columns

Create one dataframe from the three group dataframes:

Now I add the three dataframes together using the pandas pd.concat function.

# using the pandas concat function to add the dataframes together.
result=pd.concat([G0,G1,G2])
result.head()

	Country	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	Group
0	Sim_Country_0	5.306706	8.921793	0.787441	66.048883	Group0
1	Sim_Country_1	5.651307	8.654395	0.809210	66.436883	Group0
2	Sim_Country_2	5.661618	9.308995	0.693635	63.622131	Group0
3	Sim_Country_3	4.861526	10.605480	0.782290	64.910602	Group0
4	Sim_Country_4	5.348437	9.782862	0.785385	66.617080	Group0

sns.pairplot(result, hue='Group', palette="colorblind");

png

data2018 =c18.drop(['Year', 'RegionCode'], axis=1)
data2018.groupby('cluster_pred').count()

	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy
cluster_pred
0	56	56	56	56
1	31	31	31	31
2	39	39	39	39

result.groupby('Group').describe()

	Life_Satisfaction								Log_GDP_per_cap		...	Social_Support		Life_Expectancy
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
Group
Group0	31.0	5.427230	0.576599	4.029621	5.049565	5.269363	5.799328	6.668034	31.0	9.285359	...	0.870293	1.053083	31.0	65.301443	2.677880	57.921714	63.916421	65.468924	66.804120	69.822780
Group1	39.0	4.623613	0.684148	3.280790	4.271048	4.548910	4.955297	6.220091	39.0	7.982179	...	0.698752	0.808642	39.0	55.698451	2.796176	48.970274	54.109545	55.680735	57.816293	61.551827
Group2	56.0	6.662692	0.790043	5.351351	5.998424	6.644296	7.096606	8.860939	56.0	10.469700	...	0.928630	1.023850	56.0	71.560481	2.236689	63.705882	70.587843	71.637011	72.599097	77.238667

3 rows × 32 columns

data2018
data2018['cluster_pred']=data2018["cluster_pred"].replace(0,'Group2')
data2018['cluster_pred']=data2018["cluster_pred"].replace(1,'Group0')
data2018['cluster_pred']=data2018["cluster_pred"].replace(2,'Group1')

sns.pairplot(data2018, hue='cluster_pred', palette="bright", hue_order=['Group0','Group1','Group2']);

png

Simulated Life Satisfaction / Life Ladder variable:

Compare statistics from actual data for 2018 to simulated data:

data2018.groupby('cluster_pred').mean()

	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy
cluster_pred
Group0	4.415971	7.732756	0.669998	54.945161
Group1	6.614540	10.401640	0.900387	71.515385
Group2	5.322920	9.294176	0.820008	64.951786

result.groupby('Group').mean()

	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy
Group
Group0	5.427230	9.285359	0.811026	65.301443
Group1	4.623613	7.982179	0.649658	55.698451
Group2	6.662692	10.469700	0.898550	71.560481

Plot actual vs simulated Life Satisfaction:

# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(9,9))
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Life_Satisfaction'], ax=axes[0,0], shade=True, label="simulated")
sns.kdeplot(data2018['Life_Satisfaction'], ax=axes[0,0], shade=True,label="actual")
# set axes title
axes[0,0].set_title("Distribution of Life Ladder")
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Log_GDP_per_cap'], ax=axes[0,1], shade=True, label="simulated")
sns.kdeplot(data2018['Log_GDP_per_cap'].dropna(), ax=axes[0,1], shade=True, label="actual")
# axes title
axes[0,1].set_title("Distribution of Log GDP per capita")
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Social_Support'].dropna(), ax=axes[1,0], shade=True, label="simulated")
sns.kdeplot(data2018['Social_Support'].dropna(), ax=axes[1,0], shade=True, label="simulated")
axes[1,0].set_title("Distribution of Social Support");
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Life_Expectancy'], ax=axes[1,1],shade=True, label="Simulated");
sns.kdeplot(data2018['Life_Expectancy'], ax=axes[1,1],shade=True, label="Simulated");
axes[1,1].set_title("Distribution of Simulated Life expectancy");  
plt.tight_layout();

png

They are not perfect matches but it will do for now! Every time a numpy.random function is run it will produce different numbers from the same distribution unless the seed is set. Now I will just check the correlations and any relationships in the data.

Comparing Relationships between the variables.

Relationships between the simulated variables:

# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,4))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(y ="Life_Satisfaction",x="Log_GDP_per_cap", data=result, ax=axes[0])
sns.regplot(y ="Life_Satisfaction",x="Social_Support", data=result, ax=axes[1])
sns.regplot(y ="Life_Satisfaction",x="Life_Expectancy", data=result, ax=axes[2])
axes[2].set_ylim(0,10); axes[2].set_xlim(40,90)
plt.tight_layout();

png

Relationships between the actual variables:

# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,4))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(y ="Life_Satisfaction",x="Log_GDP_per_cap", data=data2018, ax=axes[0])
sns.regplot(y ="Life_Satisfaction",x="Social_Support", data=data2018, ax=axes[1])
sns.regplot(y ="Life_Satisfaction",x="Life_Expectancy", data=data2018, ax=axes[2])
axes[2].set_ylim(0,10); axes[2].set_xlim(40,90)
plt.tight_layout();

png

# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(9,9))
# plot the distributions of each of the main variables. At global level first. Look at Regional after
sns.distplot(result['Life_Satisfaction'].dropna(), ax=axes[0,0], bins=10, color="r");
# set axes title
axes[0,0].set_title("Distribution of Simulated Life Ladder");
sns.distplot(result['Log_GDP_per_cap'].dropna(), ax=axes[0,1], bins=10, color="g");
axes[0,1].set_title("Distribution of Simulated Income");
sns.distplot(result['Social_Support'].dropna(), ax=axes[1,0], bins=10, color="b");
axes[1,0].set_title("Distribution of Simulated Social Support");
sns.distplot(result['Life_Expectancy'].dropna(), ax=axes[1,1], bins=10, color="y");
axes[1,1].set_title("Distribution of Simulated Life expectancy");  
plt.tight_layout();

png

# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(9,9))
# plot the distributions of each of the main variables. At global level first. Look at Regional after
sns.distplot(data2018['Life_Satisfaction'].dropna(), ax=axes[0,0], bins=10, color="r");
# set axes title
axes[0,0].set_title("Distribution of Actual Life Ladder");
sns.distplot(data2018['Log_GDP_per_cap'].dropna(), ax=axes[0,1], bins=10, color="g");
axes[0,1].set_title("Distribution of Actual Income");
sns.distplot(data2018['Social_Support'].dropna(), ax=axes[1,0], bins=10, color="b");
axes[1,0].set_title("Distribution of Actual Social Support");
sns.distplot(data2018['Life_Expectancy'].dropna(), ax=axes[1,1], bins=10, color="y");
axes[1,1].set_title("Distribution of Actual Life expectancy");  
plt.tight_layout();

png

data2018

	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	cluster_pred
10	2.694303	7.494588	0.507516	52.599998	Group0
21	5.004403	9.412399	0.683592	68.699997	Group1
28	5.043086	9.557952	0.798651	65.900002	Group2
45	5.792797	9.809972	0.899912	68.800003	Group1
...	...	...	...	...	...
1654	5.005663	9.270281	0.886882	66.500000	Group2
1667	5.295547	8.783416	0.831945	67.900002	Group2
1690	4.041488	8.223958	0.717720	55.299999	Group0
1703	3.616480	7.553395	0.775388	55.599998	Group0

126 rows × 5 columns

The simulated data:

The final simulated dataframe containing the 6 variables is called result.

result

	Country	Life_Satisfaction	Log_GDP_per_cap	Social_Support	Life_Expectancy	Group
0	Sim_Country_0	5.306706	8.921793	0.787441	66.048883	Group0
1	Sim_Country_1	5.651307	8.654395	0.809210	66.436883	Group0
2	Sim_Country_2	5.661618	9.308995	0.693635	63.622131	Group0
3	Sim_Country_3	4.861526	10.605480	0.782290	64.910602	Group0
...	...	...	...	...	...	...
52	Sim_Country_52	6.862937	10.019423	0.928102	71.754594	Group2
53	Sim_Country_53	5.862264	10.268521	0.907081	71.617507	Group2
54	Sim_Country_54	6.288760	11.645804	0.942329	76.936191	Group2
55	Sim_Country_55	6.509125	10.598213	0.901340	70.859460	Group2

126 rows × 6 columns

Using the multivariate normal distribution function.

An alternative way to simulating the dataset could be to use the multivariate normal distribution function. While the variables are not all normally distributed at the dataset level, when broken down by the regions they do appear to be gaussian with each groups distributions having very different centres and spreads to each other.

There is a numpy.random.multivariate function that can generated correlated data.

I found a thread at stack overflow referencing how to use the multivariate function to [generate correlated data] (https://stackoverflow.com/a/16026231)[26] Here I filter the data for each group and calculate the covariance on the first three variables.

data2018.groupby('cluster_pred').describe()

	Life_Satisfaction								Log_GDP_per_cap		...	Social_Support		Life_Expectancy
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
cluster_pred
Group0	31.0	4.415971	0.730419	2.694303	3.997857	4.379262	4.924668	5.819827	31.0	7.732756	...	0.739598	0.864215	31.0	54.945161	2.912598	48.200001	53.450001	55.299999	57.100000	59.599998
Group1	39.0	6.614540	0.740771	5.004403	6.109656	6.665904	7.205219	7.858107	39.0	10.401640	...	0.930485	0.965962	39.0	71.515385	2.043286	68.300003	69.450001	72.300003	73.199997	75.000000
Group2	56.0	5.322920	0.704197	3.561047	4.840304	5.296465	5.902972	6.626592	56.0	9.294176	...	0.882849	0.984489	56.0	64.951786	2.692346	58.500000	63.574999	65.950001	67.025000	68.199997