Introduction to the simulation project for the Programming in Data Analysis Project 2019 as part of the Higher Diploma in Data Analytics at GMIT.
Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
1. Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
The real-world phenomenon I have chosen is the World Happiness Score, in particular the main determinants of happiness at country levels across the world as reported in the World Happiness Report [6].
The variables on which the national and international happiness scores are calculated are very real and quantifiable. These include socio-economic indicators such as Gross Domestic Product (GDP), life expectancy as well as some life evaluation questions regarding freedom, perception of corruption, family or social support. Differences in social support, incomes and healthy life expectancy are the three most important factors in determining the overall happiness score according to the World Happiness Reports.
The aim of the World Happiness report is to see what countries or regions rank the highest in overall happiness and each of the six factors contributing to happiness. Over the years the reports looked at how country ranks and scores changed and whether any country experienced a significant increase or decrease in happiness.
The researchers studied how 6 different factors contribute to the happiness scores and the extent of each effect. These are economic production, social support, life expectancy, freedom, absence of corruption, and generosity. They looked at how these factors contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. While these factors have no impact on the total score reported for each country, they were analysed to explain why some countries rank higher than others. These factors describe the extent to which these factors contribute in evaluating the happiness in each country.
2. Investigate the types of variables involved, their likely distributions, and their relationships with each other.
First I will investigate the variables in the datasets used by the World Happiness Reports. I will study their distributions by looking at descriptive statistics and plots such as histograms and boxplots. I will explore relationships that may exist between the variables using visualisations such as scatterplot, pairplots etc and statistics such as correlation and covariance statistics.
With this information in mind, I will try to create a simulated dataset that is as close to the real world phenomenon as possible.
3. Synthesise/simulate a data set as closely matching their properties as possible.
Having studied the distributions of the real dataset by looking at statistics and plot I will use Python to simulate the data, focusing on using the numpy.random
package as much as possible but using other Python libraries as may be required. I will look at how simulation is performed and what must be considered when simulating a dataset such as this one. I will look at how each of the variables are distributed and how they could be simulated. I will also consider the relationships between the variables. While it might be relatively straightforward to simulate a single variable, modelling the real-world correlations between the variables will be challenging.
As there is much inequality in the world, this will be reflected in the distribution of the variables that model factors such as income and life expectancy. Therefore I will need to look at regional variations and how this would affect the simulation of data. The distributions are likely to vary between regions, particularly between the lesser developed countries and the countries of the more developed world.
4. Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.
All the analysis will be documented in this notebook which means that the document is quite lengthy and might take some time to load. The first section of code involves reading in the real world dataset and getting it into a state where it is ready to be analysed. The data is available in excel and csv files which is left unchanged. As the files containing the 2018 did not include the geographic regions of the countries studied, I had to add these to the data by merging with an earlier dataset. Some other manipulation such as renaming columns, dropping unnecessary columns, adding region codes etc is documented below. The end result of this is written to a csv files to the data folder in this repository.
About data simulation.
The goal of this project is to simulate a dataset. Simulating data is used for a number of reasons. Monte carlo simulation are used to simulate real world problems using repeated random sampling while simulated data is very useful for learning and demonstration purposes. Data can be simulated before the real world data is collected to help identify the type of tests and programs that need to be run. Collecting data requires resources of time and money whereas data can be simulated easily using computer programs.
Statistical analysis can be performed on the simulated data in advance of collecting the real data this process can be repeated as many times as needed. By studying simulated data you can become more familiar with the different kinds of data distributions and be in a better position to make decisions about the data and what to do with it such as how to measure it and how much is required. Simulations produce multiple sample outcomes. Experiments can be run by modifying inputs and seeing how this changes the output. The process of generating a random sample can be repeated many many times which will allow you to see how often you would expect to get the outcomes you get. Repeating the process gives multiple outcomes which can then be averaged across all simulations.
When data is collected, it is often only a small sample of data from the overall population of interest. The researchers of the World Happiness Reports did not collect all the data about the variables of interest. The typical sample size used per country was 1000 people while some countries had more than one survey per year and others had less. A sample is a subset of numbers from a distribution and the bigger the sample size the more it resembles the distribution from which it is drawn. Depending on the distribution the data is drawn from, some numbers will occur more often than others. Sample statistics are descriptions of the data that can be calculated from the sample dataset and then be used to make inferrences about the population. The population parameters are of most interest. These are the characteristics of the actual population from which a sample dataset is taken. Samples are used to estimate the parameters of the population. The sample mean is the mean of the numbers in the sample while the population mean is the mean of the entire population but it is not always possible to study the entire population directly. The law of large numbers refers to how as a sample size increases, the sample mean gets closer to the true population mean. Under the law of large numbers the more data that is collected, the closer the sample statistics will get to the actual true population parameters.
The sampling distribution of the sample means is when you collect many samples from the population and calculate the sample means on each sample. If you know the type of distribution you could sample some data from this distribution, calculate the means or any other sample statistic of the samples and plot them using a histogram to show the distribution of the sample statistic. The sampling distributions can tell you what to expect from your data.
Simulation can be used to find out what the sample looks like if it comes from that particular distribution. This information can be used to make inferences about whether the sample came from particular distribution or not. The sampling distribution of a statistic varies as a function of sample size. Small sample taken from the distribution will probably have sample statistics such as sample means that vary quite a bit from sample to sample and therefore the sampling distribution will be quite wide. Larger samples are more likely to have similar statistics and a narrower sampling distribution.
As the size of the samples increases, the mean of the sampling distribution approaches the mean of the population. The sampling distribution is itself a distribution and has some variance. The standard deviation of the sampling distribution is known as the standard error. As the sample size increases, the standard error of the sample mean decreases. According to the central limit theorem, as the sample size increases the sampling distribution of the mean begins to look more like a normal distribution, no matter what the the shape of the population distribution is.
Large experiments are considered more reliable than smaller ones. If you take a big enough sample, the sample mean gives a very good estimate of the population mean.
When simulating a random variable, you first need to define the possible outcomes of the random variable. To do this you can use the sample statistics from the sample dataset. Using simulated data therefore allows you to identify coding errors as you know what the outcomes should be.
Resampling methods are another way of simulating data and involve resampling with replacements. Bootstrap resampling is the most common method.
For this section I referred to an online book called Answering Questions with Data (Textbook): Introductory Statistics for Psychology Students by Matthew J C Crump [6].