by Angela Carpenter.
This notebook contains my submission for the Machine Learning and Statistics Tasks 2020 Project as part of the requirement for the Machine Learning and Statistics module at GMIT as part of the Higher Diploma in Computing and Data Analytics programme.
Write a Python function called sqrt2 that calculates and prints to the screen the square root of 2 to 100 decimal places. Your code should not depend on any module from the standard library or otherwise. You should research the task first and include references and a description of your algorithm.
The square root of a number is a value that when multiplied by itself gives the number. In geometrical terms the square root function maps the area of a square to its side length.
Square roots of positive numbers are not in general rational numbers, and so cannot be written as a terminating or recurring decimal expression. Therefore in general any attempt to compute a square root expressed in decimal form can only yield an approximation, though a sequence of increasingly accurate approximations can be obtained. [1]
The square root of two then is the number that when multiplied by itself gives the number two. Geometrically the square root of 2 is the the length of the diagonal of a square where each side of the square is of length 1.
The square root of 2, or the one-half power of 2, written in mathematics as $2{\sqrt {2}}$ or $2^{{1/2}}$ is the positive algebraic number that, when multiplied by itself, equals the number 2. Technically, it must be called the principal square root of 2, to distinguish it from the negative number with the same property. [2] The square root of two was probably the first number known to be irrational.
An irrational number is one which cannot be expressed as a ratio of two numbers, i.e. a fraction. Any number that can be expressed as a ratio of two integers, a decimal that either ends or repeats in a recurring manner or a percentage is said to be a rational number. All whole numbers are rational as they can be represented as fractions divided by 1. Greek mathematicians such as Pythagoras had believed that all numbers could be shown to be fractions, believing that the number line was made up entirely of fractions, because for any two fractions we can always find another fraction in between them. Because this process has no end there are infinitely many such points. Hippasus, a Greek scholar is believed to have first identified irrational numbers in the 5th century BCE as he worked on geometrical problems.[3] He applied Pythagoras's theorem, which states that the square of the hypothenuse in a triangle is equal to the sum of the squares of the other two sides, to a right-angled triangle with both shorter sides equal to one. Hippasus realised that the length of the hypothesus of such as triangle was equal to the square root of two but that the square root of two could not be expressed as a ratio of two whole numbers, i.e. as a fraction. There is no rational number which can be multiplied by itself to give precisely two which means that the square root of two an irrational number. The number two is termed non-square or square free. The numbers 3, 5 and 7 and many others are also non-square with irrational square roots. Numbers such as 4, 9 and 16 are square numbers and do have square roots that are also whole numbers and therefore rational. The value of the square root of two cannot be determined accurately because it cannot be represented as a fraction in the form a/b where a and b are whole numbers and has an infinite number of decimals.[4]
The square root of a number $x$ is rational if and only if $x$ is a rational number that can be represented as a ratio of two perfect squares. The square roots of perfect squares such as 1, 4, 9 and 16 are integers. In all other cases, the square roots of positive integers are irrational numbers, and therefore have non-repeating digits in any standard positional notation system.
The square root of 2 is a real number that is not rational. A real number gives a positive result when squared. The square root of two gives a positive result when squared of 2. An irrational number has an infinite number of decimals with no recurring pattern. This is the case for the square root of two. The square root of 2 truncuated to 50 decimal places is 1.41421356237309504880168872420969807856967187537694.[1]
In terms of computing, computer programs are made up of binary strings of 1's and 0's which can be converted into integers. Every computer program can be converted into one of the counting numbers from the set of $\mathbb{N}$. The natural numbers are often referred to as the counting numbers as they are used for counting and ordering. Natural numbers $\mathbb{N}$, excluding negative integers and zero, are often known as the counting numbers. They are discrete numbers in contrast to measuring numbers or real numbers where the values are continuous and represent a distance along a line. The natural numbers are the basis from which other number sets are built by extension.
According to Turing, all computer programs are countable which means they can be put into one-to-one correspondance with the natural numbers but Turing also realised that not all numbers can be calculated or computed.[5] The set of $\mathbb{N}$ natural numbers are an infinite set but it is the smallest infinite set there is and is known as countable infinite. The set of measuring numbers are all the numbers on the number line that are used for quantifying and measuring things and there is an infinite amount of such numbers between any two discrete points on the number line.
A rational number can be expressed as the ratio of 2 whole numbers. It can be stored as an array of two numbers. A number is rational only if it's decimal expansion becomes periodic. Real numbers can be represented as an infinite decimal expansion. Real numbers such as $\sqrt{2}$ have an infinite number of places after the decimal point. Computers cannot really store such real numbers as there is only a finite number of bits available to store the number.
Floating point numbers are stored on a computer with a finite number of digits so there is a finite limit on how many decimal places they can actually store. Floating point numbers are actually stored using integers in two parts, for the whole number part before the decimal point (the characteristics) and the part of the whole after the decimal point (the mantissa).
The square root of two is an irrational number and therefore cannot be expressed in the ratio of two numbers in their most simplified form with the resulting implications being that the square root of two can never be stored in a computer. Instead an approximation of its value can be stored (in the same way an approximation for the irrational number $pi$ is stored). Irrational numbers are often stored as procedures which are used to calculate but not store the number.
For this reason caution is always required when calculation statistics or performing analytics on irrational numbers. It is possible to work theoretically or to use approximations but rounding errors could interfere will the calculations.
Chapter 5 of the Python Tutorial looks at the issues and limitations associated with floating point arithmetic. It illustrates how the decimal fraction 0.125 and the binary fraction 0.001 have identical values:
Unfortunately, most decimal fractions cannot be represented exactly as binary fractions. A consequence is that, in general, the decimal floating-point numbers you enter are only approximated by the binary floating-point numbers actually stored in the machine. [6]
The Python tutorial demonstrates how the fraction $\frac{1}{3}$ can be approximated as a base 10 fraction as 0.3 or 0.33 or 0.333 but that no matter how many 3's you use the result will never actually be $\frac{1}{3}$. It will be increasingly a better approximation of $\frac{1}{3}$ though.
The same tutorial also demonstrates how the decimal value 0.1 cannot be represented exactly as a base 2 fraction because in base 2, $\frac{1}{10}$ is the infinitely repeating fraction
0.0001100110011001100110011001100110011001100110011... and if you stop at any finite number of bits you get an approximation.
On most machines today floats are approximated using a binary fraction where the numerator uses the first 53 bits starting with the most significant bit and the denominator is as a power of 2.
They illustrate how the binary fraction for the decimal 0.1 is 3602879701896397 / 2**55
which is close to the true value of $\frac{1}{10}$ but it is not exactly equal to the true value of 0.1. However we generally don't notice that this is not the true value and only an approximation because of the way values are displayed. Python and other programming languages only prints a decimal approximation to the to the true decimal value of the binary approximation stored by the machine.
The true decimal value of the binary approximation stored for 0.1 is actually 0.1000000000000000055511151231257827021181583404541015625
but it is printed as 0.1. Python displays a rounded value instead of all the digits as in most situations this is enough.
The important point is that even though what we see printed (0.01) looks like the exact value of $\frac{1}{10}$, the actual value that is stored is the nearest representable binary fraction.
The article continues by illustrating how there are many different decimal numbers which actually share the same nearest approximate binary fraction. For example the numbers 0.1
and 0.10000000000000001
and 0.1000000000000000055511151231257827021181583404541015625
are all approximated by 3602879701896397 / 2 ** 55
. These decimal values all have the same approximation.
The errors in Python float operations are inherited from the floating-point hardware, and on most machines are on the order of no more than 1 part in
2**53
per operation. That’s more than adequate for most tasks, but you do need to keep in mind that it’s not decimal arithmetic and that every float operation can suffer a new rounding error. [6]
String formatting can be used to produce a limited number of significant digits but this is simply rounding the display of the true machine value. For example format(math.pi, '.12g')
gives 12 significant digits while format(math.pi, '.2f')
gives 2 digits after the point.
Representation error refers to the fact that most decimal fractions cannot be represented exactly as binary (base 2) fractions in Python and other languages and therefore won’t display the exact decimal number you expect. Most machines use IEEE-754 floating point arithmetic and almost all platforms map Python floats to IEEE-754 “double precision” which contain 53 bits of precision. The computer strives to convert 0.1 to the closest fraction that it can of the form $J/2**N$ where J is an integer containing exactly 53 bits.
IEEE 754 has 3 basic components:
* The sign of the mantissa where `0` represents a positive number while `1` represents a negative number.
* The biased exponent
* The Normalised Mantissa
[7],[8].
Python does have some solutions for dealing with these issues. It's decimal module implements decimal arithmetic suitable for accounting applications and high-precision applications while the fractions module implements arithmetic based on rational numbers (so the numbers like $\frac{1}{3}$ can be represented exactly). The NumPy and scipy packages are recommended for mathematical and statistical operations. Python also has a float.as_integer_ratio() method which expresses the value of a float as a fraction. This can help when you want to know the exact value as a float. As the ratio is exact, it can be used to losslessly recreate the original value.
Both Python's math
module and the numpy
library have a function called sqrt
for calulating square roots.
math.sqrt(x)
calculates the square root of whatever is passed in as x
. While this cannot be used for Task 1 itself, it can be used to check the results. If you square the results of either of these functions applied to the number 2, the result will be 2.0000000000000004 which demonstrates exactly the problem with performing calculations on the irrational number square root of two!
Finding the square root of a number is the inverse of finding the square of a number. The square of a number is the number times itself. The perfect squares are the squares of the whole numbers. The square root of a number is the number that when multiplied by itself gives the number.
The square roots of the perfect squares (e.g., 0, 1, 4, 9, 16) are integers. In all other cases, the square roots of positive integers are irrational numbers, and hence have non-repeating decimals in their decimal representations. [2]
While getting the square root of the perfect squares is straightforward enough, getting the square root of numbers that are not perfect squares is not so. A manual calculation would start by looking at the square roots of the nearest perfect squares on either side of the number in question and involves some guesswork to begin. The most common iterative method of square root calculation by hand is known as the "Babylonian method". This is a guess and divide method which takes a starting estimate to the square root $x_0$, divides this starting estimate into the number $\text{a}$ whose square root you are looking for and then finds the average of the quotient $x_0$ and the divisor $a$ to get the next estimate $x_{0+1}$.
Newtons Method which is also known as the Newton–Raphson method, can be used to get an estimate of the square root of a number. It is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function. [9]Newton's Method is one of many methods of computing square roots and is outlined in detail in the Wikipedia article. It outlines how the problem of finding the square root of a number $a$ can be rephrased as finding the zero of $f(x)=x2−a$ where $f$ is a single-variable function defined for a real variable $x$, with an initial guess of $x_0$ for a root of $f$. The derivative of the function $f'(x)=2x$. Rearranging the formula yields the Babylonian method of finding square roots :
$$x_{n+1}=x_{n}-{\frac{f(x_{n})}{f'(x_{n})}}=x_{n}-{\frac{x_{n}^{2}-a}{2x_{n}}}={\frac {1}{2}}{\biggl(}2x_{n}-{\Bigl(}x_{n}-{\frac{a}{x_{n}}}{\Bigr)}{\biggr)}={\frac {1}{2}}{\Bigl(}x_{n}+{\frac {a}{x_{n}}}{\Bigr )}$$i.e. the arithmetic mean of the guess, $x_n$ and $\frac{a}{x_n}$
Using Newtons Method, regardless of the starting estimate, once the square of the estimate gets close enough to the actual number itself, then that is considered a good approximation for the square root. The Baylonian Method uses the same iterative scheme as the Newton–Raphson method yields when applied to the function $y = f(x) = x2 − a$, using the fact that its slope at any point is $dy/dx = f′(x) = 2x$. [1] You consider the nearest perfect squares on both sides of the number to make a starting guess of what the square root could be. The Babylonian method of guess and divide takes a starting estimate to the square root ($x_0$), divides this starting estimate into the number $a$ whose square root you are looking for and then finds the average of the quotient $x_0$ and the divisor $a$ to get the next estimate $x_{0+1}$.
Rick Wicklin explains simply in a blog post on Newton's method and an iterative square root algorithm[10] how the algorithm taught in school for calculating the estimating a square root from an arbitrary guess is an iterative procedure. Given any positive number $S$ and any positive number $x_0$, apply the following formula: $$x_{n+1} = \frac{(x_n + S/x_n)}{2}$$ until the iteration converges to the square root of $S$. The same author has a post on the [11]Babylonian method for finding square roots by hand. Suppose you are given any positive number $S$. To find it's square root do the following:
Wikipedia[1] similarly describes the process: The algorithm repeats a simple calculation that results in a number closer to the actual square root each time it is repeated with its result as the new input. The motivation is that if x is an overestimate to the square root of a nonnegative real number a then $\frac{a}{x}$ will be an underestimate and so the average of these two numbers is a better approximation than either of them. However, the inequality of arithmetic and geometric means shows this average is always an overestimate of the square root (as noted below), and so it can serve as a new overestimate with which to repeat the process, which converges as a consequence of the successive overestimates and underestimates being closer to each other after each iteration. To find x:
That is, if an arbitrary guess for $\sqrt {a}$ is $x_0$, and $x_{n+1} =\frac{ (x_n + \frac{a}{x_n})}{2}$, then each $x_n$ is an approximation of $\sqrt{a}$ which is better for large $n$ than for small $n$. If $a$ is positive, the convergence is quadratic, which means that in approaching the limit, the number of correct digits roughly doubles in each next iteration. If $a = 0$, the convergence is only linear. Using the identity $\sqrt{a}=2^{-n}{\sqrt{4^{n}a}}$, the computation of the square root of a positive number can be reduced to that of a number in the range
[1,4)
. This simplifies finding a start value for the iterative method that is close to the square root, for which a polynomial or piecewise-linear approximation can be used.
The time complexity for computing a square root with $n$ digits of precision is equivalent to that of multiplying two n-digit numbers.
The methods described above can be put into code using an iterative algorithm.
First we make a guess of what the root of the number in question could be based on the square root of the nearest perfect squares on either side. Add the result of this to the number and question and divide by two to get the average as the new starting guess. Keep repeating.
Find two perfect square roots that the number in question lies between.
Take one of those square roots as an estimate and divide the number in question by it.
Take the average of the latest estimate x1 (2) and S (from the previous step above) and the Number.
x9 = (x8 + S/x8)/2 = (1.414213562373095 + 2/1.414213562373095)/2 = 1.414213562373095
math.sqrt(10)
gives 3.1622776601683795
As the illustration shows here, once you get to the 5th iteration, the result is rounded.
print(f"{(1.4166666666666665 + 2/1.4166666666666665)/2}")
print(f"{(1.4142156862745097 + 2/1.4142156862745097)/2}")
print(f"{(1.4142135623746899 + 2/1.4142135623746899)/2}")
print(f"{(1.414213562373095 + 2/1.414213562373095)/2}")
print(f"{(1.414213562373095 + 2/1.414213562373095)/2}")
# and on until you get the required accuracy
# trying to get the square root of two by using the power of 1/2
200000000000000000000000000000000000000**(1/2)
import numpy as np
print(np.sqrt(2))
import math
print(math.sqrt(2))
The most this method yielded was 15 digits. While the algorithm itself will calculate approximately the square root of two (accurate to 15 digits at least), it will not print out 100 decimal places so easily.
At some stage the result gets rounded and this is the main issue as outlined earlier when dealing with real numbers. Wikipedia records the square root of 2 to 50 decimal places as 1.41421356237309504880168872420969807856967187537694
.[1]
Numpy's sqrt
function returns the same answer as my function above to 15 decimal places. The square root of 2 using np.sqrt
to 50 decimal places is 1.41421356237309514547462185873882845044136047363281 which differs from that recorded at Wikipedia.
I did try to increase the precision of the while loop to see if this would yield additional digits but the maximum was still only 15 decimal places. This is as expected given the background of how computers deals with floating point numbers. Unfortunately, most decimal fractions cannot be represented exactly as binary fractions. A consequence is that, in general, the decimal floating-point numbers you enter are only approximated by the binary floating-point numbers actually stored in the machine. [6] According to the Python Docs section on numeric types integers have unlimited precision.
Therefore the next logical step to me was to apply the function to a very large integer multiple of 2 as this would not have the same limitations as floats and then divide the result by the same integer multiple. In the same way that the square root of 4 is 2, the square root of 400 is 20, and so on. The whole number before the decimal point cannot be rounded. In this way the digits of the square root of two could be produced but with the decimal point in the incorrect place. Some string manipulation then to return the correct result.
Applying either Newtons formula or the equivalent Babylonian guess and divide formula resulted in 1.4142135623730951 being returned as the approximate square root of two (using precision of 0.0000000000001 to stop the while loop.)
Applying the function to get the square root of $2x10^{200}$ gave the result in scientific notation. Therefore the 100 digits cannot be extracted at this stage.
Nevertheless I compared the results to math.sqrt
to see how they fare and the results are the same.
Python only prints a decimal approximation to the true decimal value of the binary approximation stored by the machine. [6]. While the printed result might look like the exact value, the actual stored value is the nearest representable binary fraction. While string formatting can be used to produce a limited number of significant digits, you are really just rounding the display of the true machine value. Also as noted earlier every float operation can suffer a new rounding error.
Here I put the above code into a function. Without setting a stopping value the loop will take a long time as will setting the stopping value smaller than 0.000000000000001.
When applying the function to a large multiple of 2 $2x10^{200}$, the function did not stop with the precision value of 15 decimal places so I changed it to 0.
The intention here was to use string formatting to get the digits out, remove the scientific notation at the end to and construct a string representation of the square root of 2 to 100 decimal places.
So the result of the square root of $2x10^{200}$ is 1.414213562373095e+100
which is presented to us in scientific notation and therefore foils my attempt to get the 100 digits!. Attempting to apply formatting to get 100 significant digits does not return the same digits as reported on Wikipedia.
It did however match the result from applying the math.sqrt
function to get the square root of $2x10^{200}$ and then using string formatting to get the 100 significant digits.
# S is number to get square root of, S= 2
def sqrt2(S):
# take the first estimate x to be the number itself
x = S
# repeat this until the difference between the Estimate Squared minus the number is small enough
# without setting a stopping value here, the loop will continue for a long time
while abs(x*x - S) > 0.000000000000001:
# Update the new estimate to be the current estimate minus the Number using Newtons formula
x -= ((x*x)- S)/ (2 * x)
# or equivalently the babylonian guess and divide formula x= (x + S/x)/2
return x
print(sqrt2(2))
# string formatting to get 100 significant digits
print(format(sqrt2(2),'.100g'))
# define the function, using multiple of 2 and changing the precision
def sqrt2(S):
S = S * 10**200
# first estimate
x = S
# repeat until there is no difference between the squared estimate gives the square back
while abs(x*x -S) > 0:
# update the new estimate to be the mean of the current estimate and the number divided by the estimate
x= (x + S/x)/2
#x -= ((x*x)- S)/ (2 * x) # this gives 1 extra digit
return x
print(sqrt2(2))
# string formatting to get 100 significant digits
print(format(sqrt2(2),'.100g'))
math
module square root function math.sqrt
.¶import math
# calculating the square root of 2e200
math_root2 = math.sqrt(2*(10**200))
print(math_root2)
format(math_root2,'.100g')
So back to the drawing board. A further bit of research found an article on getting the square roots of integers on geeksforgeeks.com[12] using a binary search function to find the largest integer i whose square is less than or equal to the original number. The values of $i * i$ is monotonically increasing, so the problem can be solved using binary search. The code sample there, contributed by Nikita Tiwari, uses integer division and defines the lower and upper bounds, the midpoint and the answer at each iteration. The loop runs until the lower bound is less than or equal to the upper bound. Then it checks if the midpoint squared is equal to the number and narrows the search to one half of the search space.
If I apply this function below to get the square root of two it returns 1 as it is using integer division. However if I apply it to a very large multiple of two $2x10^{200}$ as outlined earlier I can get the integer result of the square root of $2x10^{200}$. While this is not the answer to the square root of two which is what the task here calls for, it does contain the first 100 digits of the square root of 2! Therefore in the function below, I adapt the code to include some string manipulation of the square root of $2x10^{200}$ and return the first 100 digits of the square root of 2. First calculate the square root of $2x10^{200}$, then convert the result (which is an integer) to a string and iterate through each digit of the string and concatenating them to a new string that begins with a '1' followed by a decimal point. In this way I have calculated and printed to the screen the square root of 2 to 100 decimal places.
# adapted and extended from code on geekforgeeks contributed by Nikita Tiwari.
# Returns floor of square root of x
def sqrt2() :
# set up an empty string to hold the result
sqrt_string =""
S = 2 * 10**200
# Base cases is only required for the general case
if (S == 0 or S == 1) :
return S
# Do Binary Search to find the square root of S
lower = 1
upper = S
while (lower <= upper) :
x = (lower + upper) // 2
# If x squared is a perfect square then return x
if (x*x==S) :
return x
# As we need the floor, update the answer when mid*mid is smaller than x, and move closer to getting the square root of S
if (x*x < S) :
lower = x + 1
ans = x
else :
# If mid*mid is greater than x
upper = x-1
#return ans
# create a string version of the result
str_ans = str(ans)
for s in str_ans:
sqrt_string += s
# concatenate the strings to produce the square root of 2
sqrt_string = "1." + sqrt_string[1:]
#print(f"The square root of 2 to 100 decimal places is \n{sqrt_string}")
# this will return an integer for square root of 2e200
#return ans
# instead return the string containing the first 100 digits of square root of 2
print(f"The first 100 digits of the square root of 2 are \n{sqrt_string}")
return sqrt_string
# call the function to print the first 100 digits of the square root of 2
sqrt2()
Nerdparadise.com has the digits of the square root of 2 to 10,000 places.[13] Here are the first 100 digits as reported there: 1. 4142135623 7309504880 1688724209 6980785696 7187537694 8073176679 7379907324 7846210703 8850387534 3276415727
These are the string results from the python function above:
14142135623 7309504880 1688724209 6980785696 7187537694 8073176679 7379907324 7846210703 8850387534 3276415727
sqrt2()
function above does yield the correct digits for the square root of 2. It is a string however and further numerical calculations cannot be performed on the string. Converting from one type to another will not always return results as expected. math.sqrt
to 15 decimal places. This task illustrates the problems with accuracy to high levels of precision and also how when converting between types you won't always get what you expect. The lecture on data types and numbers highlighted that there are some irrational numbers that simply cannot be stored on a computer. Procedures can be written to produce the value when needed though. The article from the Python tutorial [6] on the problems with floating point numbers showed an example of how there are many different decimal numbers which actually share the same nearest approximate binary fraction. It demonstrated how the numbers 0.1 and 0.10000000000000001 and 0.1000000000000000055511151231257827021181583404541015625 are all approximated by 3602879701896397 / 2 ** 55.
The first function above applied to get the square root of $2x10^{200}$ returned scientific notation as does the math.sqrt
function when applied to such a large number. Using string formatting to display the actual digits did not print the correct digits to the screen. This shows that what we see printed out is not always the exact true value.
In reality, while we would be concerned with the accuracy of results, other considerations would be the computational cost and time. When I tried to run the earlier versions of the square root function, my computer struggled and I had to interrupt the process. The final function above uses bounds though. Newton's method uses approximations and while it converges towards the actual result, the number of steps to reach this is infinite. This is combined with the problems outlined earlier of computing and storing real numbers on a computer. I came across MIT course notes on the Square Roots via Newton’s Method[14] online. The author S.G.Johnson notes that we are concerned not only with existence and correctness of the solutions, but with the time (and other computational resources, e.g. memory) required to compute the result. Some algorithms like Newtons method are intrinsically approximate and while they converge towards the desired result they never reach it in a finite number of steps. How fast they converge is a key question. He also points out that arithmetic with real numbers is approximate on a computer, because we approximate the set $\mathbb{R}$ of real numbers by the set $\mathbb{F}$ of floating-point numbers, and the result of every elementary operation (+,−,×,/) is rounded to the nearest element of $\mathbb{F}$. Therefore it is important to understand floating point numbers and how accumulation of these rounding errors affects different algorithms.
Wikipedia notes that the time complexity for computing a square root with $n$ digits of precision is equivalent to that of multiplying two n-digit numbers.
The Chi-squared test for independence is a statistical hypothesis test like a t-test. It is used to analyse whether two categorical variables are independent. The Wikipedia article gives the table below as an example [4], stating the Chi-squared value based on it is approximately 24.6. Use scipy.stats to verify this value and calculate the associated p value. You should include a short note with references justifying your analysis in a markdown cell.
A | B | C | D | Total | |
---|---|---|---|---|---|
White Collar | 90 | 60 | 104 | 95 | 349 |
Blue Collar | 30 | 50 | 51 | 20 | 151 |
No Collar | 30 | 40 | 45 | 35 | 150 |
Total | 150 | 150 | 200 | 150 | 650 |
Wikipedia contributors, “Chi-squared test — Wikipedia, the free encyclopedia,” 2020, [Online; accessed 1-November-2020]. [Online]. Available: https://en.wikipedia. org/w/index.php?title=Chi-squared test&oldid=983024096
A chi-squared test, also written as $\chi^2$ test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. In the standard applications of this test, the observations are classified into mutually exclusive classes. If the null hypothesis that there are no differences between the classes in the population is true, the test statistic computed from the observations follows a $\chi^2$ frequency distribution. The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true. Test statistics that follow a $\chi^2$ distribution occur when the observations are independent and normally distributed, which assumptions are often justified under the central limit theorem. There are also $\chi^2$ tests for testing the null hypothesis of independence of a pair of random variables based on observations of the pairs. Chi-squared tests often refers to tests for which the distribution of the test statistic approaches the $\chi^2$ distribution asymptotically, meaning that the sampling distribution (if the null hypothesis is true) of the test statistic approximates a chi-squared distribution more and more closely as sample sizes increase. [15]Chi-squared test
In the Wikipedia article from which the above table was taken, the Chi-Squared test is testing the null hypothesis of independence of a pair of random variables based on observations of the pairs, i.e to analyse whether two categorical variables are independent. Hypothesis testing is a decision making process for evaluating a hypothesis. A hypothesis is a claim about a population parameter. Hypothesis testing consists of the following steps:
In a statistical hypothesis problem there are always two and only two contradictory hypotheses. A Null Hypothesis $H_0$ is a claim or statement about population parameters that is assumed to be true until it is declared false. An Alternative Hypothesis $H_A$ or $H_1$ is the competing hypothesis, it is a claim about a population parameter that will be true if the null hypothesis is false. The null hypothesis is that each person's neighbourhood of residence is independent of the person's occupational classification. The decision whether or not to reject a null hypothesis $H_0$ is based on the evidence contained in the sample data.
The data here are tabulated on a two-way frequency table. Two-way frequency tables display frequencies or relative frequencies for two categorical variables. Categorical data is data where each observation belongs to one of a set of categories. It is often referred to as count data because the number of observations belonging to a particular category are counted. The data is bivariate as two variables are measured on each member of the sample. The data entries in the table refer to two categories. The two-way frequency table is used to examine the relationship between two categorical variables, in this case to see if the proportions of residents living in the neighbourhoods (belonging to the categories) A, B, C and D are the same for the occupational categories of White Collar, Blue Collar and No Collar.
The city has a population of 1,000,000 residents with four neighbourhoods, A, B ,C and D from which a random sample of 650 residents is taken and their occupation recorded as "white collar", "blue collar" or "no collar".
A $\chi^2$ test is used to determine if there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. If the null hypothesis that there are no differences between the classes in the population is true then the test statistic computed from the observations follows a $\chi^2$ frequency distribution. The test evaluates how likely the observed frequencies would be assuming the null hypothesis is true, the null hypothesis being that each person's neighborhood of residence is independent of the person's occupational classification.
A | B | C | D | Total | |
---|---|---|---|---|---|
White Collar | 90 | 60 | 104 | 95 | 349 |
Blue Collar | 30 | 50 | 51 | 20 | 151 |
No Collar | 30 | 40 | 45 | 35 | 150 |
Total | 150 | 150 | 200 | 150 | 650 |
The cell counts are the numbers in a particular sample that belong to a particular category. They are frequency counts and the table is a frequency table. Relative frequences such as percentages or proportions could also be used to represent the same data in which case it would be a relative frequency table with the table entries showing the conditional distribution. The column and row totals are known as the marginal totals or the marginal distribution and and each set of marginal totals adds to a single grand total. In this case the row totals are the sizes of the samples selected from the different occuational populations (i.e. white-collar, blue-collar, no-collar). The column totals are the sizes of the samples selected from the different neighbourhood populations A, B, C and D.
The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification and therefore the neighbourhood category proportions are the same for all occupational classifications. Knowing a person's occupational classification therefore would not tell you anything about where they live. This hypothesis considers whether the proportion of workers living in the neighbourhoods "A", "B","C" and "D" are the same for the person's classification as White-Collar, Blue-Collar or No-Collar.
150 of the sample live in neighbourhood A. This figure can be used to estimate what proportion of the 1 million residents live in neighbourhood A and similarly estimates can be calculated for neighbourhoods B, C and D. In the same way, 349 of the sample are white-collar workers. This figure can be used to determine what proportion if the 1 million residents are white-collar workers. Again similar estimates can be calculated for blue-collar and no-collar workers. This figure can be used to determine what proportion of the 1 million residents are blue-collar workers. The proportions in each neighbourhood A, B C and D are calculated below based on the sample at 23.08%, 23.08%, 30.77% and 23.08% respectively. The proportions of each occupation type White Collar, Blue Collar and No Collar are 53.69%, 23.23% and 23.08% respectively.
We need to calculate the frequencies (cell counts) that would be expected if the null hypothesis was true. Assuming the null hypothesis is true, for a two-way table the expected cell counts for each cell = $\frac{\text{row total} \times \text{column total}}{Grand Total}$.
For example, by the assumption of independence under the null hypothesis we should expect the number of White Collar's in neighbourhood A to be: $$150 *\frac{349}{650} \approx 80.54$$ This is the value for the expected frequency in the first cell of the first row. The expected frequency for the remaining cells are calculated in a similar fashion to populate the expected frequency table.
A, B, C, D =150, 150, 200,150
print(f"The column totals are {A} in neighbourhood A, {B} in B, {C} in C and {D} in D ")
White, Blue, No = 349,151,150
print(f"The Row totals are {White} for White Collar, {Blue} for Blue Collar and {No} for No Collar. ")
Total = A+B+C+D
print(f"The Grand Total in the sample is {Total}.")
print("The proportions in each neighbourhood A, B C and D are {:.2%}, {:.2%}, {:.2%} and {:.2%} respectively.".format(A/Total,B/Total,C/Total,D/Total))
print("The proportions of each occupation type White Collar, Blue Collar and No Collar {:.2%}, {:.2%} and {:.2%} respectively.".format(White/Total,Blue/Total,No/Total))
print("The expected number of White Collars in each of the neighbourhoods are: A: {:.2f}, B: {:.2f}, C: {:.2f} and D: {:.2f}.".format(A*White/Total,B*White/Total,C*White/Total,D*White/Total))
print("The expected number of Blue Collars in each of the neighbourhoods are: \tA: {:.2f}, B: {:.2f}, C: {:.2f} and D: {:.2f}.".format(A*Blue/Total,B*Blue/Total,C*Blue/Total,D*Blue/Total))
print("The expected number of No Collars in each of the neighbourhoods are: \tA: {:.2f}, B: {:.2f}, C: {:.2f} and D: {:.2f} .".format(A*No/Total,B*No/Total,C*No/Total,D*No/Total))
Once you have the expected frequencies or expected cell counts as well as the observed freqeuncies, the next step is to compare them. If the differences between them are so big that they could not be explained by sampling variation then the null hypothesis is rejected in favour of the alternative hypothesis.
The comparison of observed and expected cell counts is made by calculating the following for each cell:
$$\frac{(Observed Count - Expected Count)^2}{Expected Count}$$For example the value for the first cell is $\frac{(90-80.54)^2}{80.54}\approx1.11$. Next the calculated value for all the cells are added together to get the chi-square test statistic.
$$\chi^2=\sum_{i=1}^{k}\frac{(O_i-E_i)^2}{E_i}$$The sum of these quantities over all of the cells is the $\chi^2$ test statistic which is $\approx24.6$ as per the Wikipedia article. Assuming the null hypothesis to be true, $\chi^2$ is approximately chi-squared distributed with degrees of freedom being $(\text{number of rows}-1)({\text{number of columns}}-1)$ which is $(3-1)(4-1)=6$ in this case.
If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.
If calculated manually the test statistic would be compared to the critical value (from a table such as the New Cambridge Statistical Tables) at a particular alpha level such as 0.05 taking into account the degrees of freedom.
Here I will put the data from the contingency table into a Pandas dataframe and then create an expected frequency table based on the observed data. The SciPy stats package will actually calculate the value of the chi-squared test statistic. This is just for illustration and to verify the results.
I first created a dataframe containing the actual observed cell counts and from this created a dataframe with the expected cell counts that would be expected under the null hypothesis. The marginal row and column totals as well as the overall grand total are the same as totals for the observed frequencies tables. From this the formula above is applied to calculate the chi-squared test statistic. (note there may be some rounding here)
import numpy as np
import pandas as pd
# create a pandas dataframe and add a row index
df=pd.DataFrame({"A":[90,30,30], "B":[60,50,40],"C":[104,51,45],"D":[95,20,35]})
df.index = ['White_collar', 'Blue_collar', 'No-Collar']
# add a total column to sum across each row
df["Total"] = df.sum(axis=1)
# add a row to sum across column totals
#(https://stackoverflow.com/questions/20804673/appending-column-totals-to-a-pandas-dataframe)
df.loc['Total']= df.sum()
df.style.format("{:.2f}")
# get the marginal row totals
WhiteCollarTotal =df.iat[0, 4]
BlueCollarTotal=df.iat[1,4]
NoCollarTotal=df.iat[2,4]
# get the marginal column total
ATotal=df.iat[3,0]
BTotal=df.iat[3,1]
CTotal=df.iat[3,2]
DTotal=df.iat[3,3]
# get the grand total
GrandTotal = df.iat[3,4]
# create a pandas dataframe using dictionaries containing zero values to start
dfE =pd.DataFrame({"A":[0,0,0], "B":[0,0,0],"C":[0,0,0],"D":[0,0,0]},dtype='float32')
# add an index
dfE.index = ['White_collar', 'Blue_collar', 'No-Collar']
# Calculate the White collar expected frequencies
dfE.iat[0,0]=(WhiteCollarTotal*ATotal)/GrandTotal
dfE.iat[0,1]=(WhiteCollarTotal*BTotal)/GrandTotal
dfE.iat[0,2]=(WhiteCollarTotal*CTotal)/GrandTotal
dfE.iat[0,3]=(WhiteCollarTotal*DTotal)/GrandTotal
# Blue collar
dfE.iat[1,0]=(BlueCollarTotal*ATotal)/GrandTotal
dfE.iat[1,1]=(BlueCollarTotal*BTotal)/GrandTotal
dfE.iat[1,2]=(BlueCollarTotal*CTotal)/GrandTotal
dfE.iat[1,3]=(BlueCollarTotal*DTotal)/GrandTotal
# No Collar
dfE.iat[2,0]=(NoCollarTotal*ATotal)/GrandTotal
dfE.iat[2,1]=(NoCollarTotal*BTotal)/GrandTotal
dfE.iat[2,2]=(NoCollarTotal*CTotal)/GrandTotal
dfE.iat[2,3]=(NoCollarTotal*DTotal)/GrandTotal
# add a total column to sum across each row
dfE["Total"] = dfE.sum(axis=1)
# add a row to sum across column totals
dfE.loc['Total']= dfE.sum()
dfE.round(2)
#dfE.style.format("{:.2f}")
dfE.style.format("{:.2f}")
dfE
# calculate the chi-squared statistic using the formula
chi = ((df-dfE)**2)/dfE
chi
chi["Total"] = chi.sum(axis=1)
chi.loc['Total']= chi.sum()
chi
print(f"The Chi-squared test statistical is calculated as {chi['Total']['Total']}")
The Chi-Squared test statistic here is calculated as 24.57120296026715 which when rounded to 1 decimal place is the same as the Chi-squared value in the Wikipedia article of $\approx24.6$. [15]
The next step is to use the scipy.stats
to verify this value and calculate the associated $p$ value.
The scipy.stats module has a number of Statistical Tests. There is a chisquare function which is used to calculate a one-way chi-square test. This chi-square
test function tests the null hypothesis that the categorical data has the given frequencies and takes as parameters an array of observed frequencies or occurences in each category, an array of expected frequencies in each category and the degrees of freedom.
There is another function, the chi2_contingency function which computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in a contingency table. It computes the expected frequencies based on the marginal sums under the assumption of independence.
The contingency.expected_freq(observed) function can be used to compute the expected frequencies from a contingency table. It takes as input parameters the n-dimensional contingency table of observed frequency values or counts and then computes the expected frequencies for the table based on the marginal sums, under the assumption that the groups associated with each dimension are independent.
The observed values must be passed into the function using a numpy array:
# observed is the multidimensional arrays of values from the original table.
observed = np.array([[90,60,104,95],[30,50,51,20],[30,40,45,35]])
print("The table of observed values: \n")
observed
The contingency.expected_freq(observed)
function returns an array of expected frequencies, based on the marginal sums of the table. The expected values returned will be of type float
. The shape of the expected values will be the same as the observed.
Our table has 3 rows and 4 columns of observed values. Therefore the array of expected values will be the same. The expected values are calculated in the same way as I showed above behind the scenes using marginal sums of the table.
# to get an array of the expected frequencies as calculated by the function
from scipy.stats.contingency import expected_freq
print("The expected frequency table: \n")
expected_freq(observed)
# the shape of the observed array
observed.shape
This chi-squared test statistic is a test for the independence of different categories of a population and should only be used when the dimension of observed is two or more. chi2_contingency
computes a chi-square statistic. It first computes the expected frequencies and the degrees of freedom from the given contingency table. The degrees of freedom is calculated as follows:
dof = observed.size - sum(observed.shape) + observed.ndim - 1
.
The function returns four pieces of output:
# import from scipy.stats
from scipy.stats import chi2_contingency
# Calculate the chi-squared test statistic, the p-value of the test, the degrees of freedom and the expected frequencies.
chi2_contingency(observed)
#print(f"The chi-squared statistic calculated using the scipy stats package is {chi2_contingency(obs)[0]} ")
print(f"The degrees of freedom is {observed.size - sum(observed.shape) + observed.ndim - 1}")
print(f"The chi-squared statistic calculated using the scipy stats package is {chi2_contingency(observed)[0]} \n")
print(f"The p-value is {chi2_contingency(observed)[1]} ")
The Chi-squared statistic calculated by scipy.stats
returns the same value as I manually calculated above using Pandas dataframe and the actual formulas. It also agrees with the value as reported in the Wikipedia article. [15]
The null hypothesis as stated in the article with the table is that each person's neighborhood of residence is independent of the person's occupational classification. The decision on whether or not to reject the null hypothesis is based on the evidence in the sample data. The result of the hypothesis test will either be:
In the example above first the population under study was defined as the hypothetical city of 1,000,000 residents with four neighbourhoods, A, B C and D.
A random sample of 650 residents were taken and their occupation recorded as "white collar", "blue collar", or "no collar".
The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification.
The evidence is evaluated using the Chi-squared statistical test which tests for independence of two categorical variables. A sample statistic is based on the a sample taken from a population. It corresponds to the population characteristic under investigation. It was calculated here as 24.5712 which when rounded is the same as the value on the Wikipedia page. The p-value was also calculated by scipy.stats
as 0.0004098425861096696.
These values can be used to decide if the two categorical variables studied are independent or not. If the sample statistic is far enough away from the hypothesised value and in the direction of the alternative hypothesis $H_A$ then we can conclude that the null hypothesis $H_0$ is probably false. This would mean that the sample could not have come from a population with the hypothesised value and the null hypothesis is rejected. Otherwise we fail to reject the null hypothesis.
Any decision making procedure can lead to wrong decisions being made. A Type I Error is where the null hypothesis $H_0$ is rejected when it is true. A Type II Error is where we do not reject the null hypothesis when it is false. With statistical hypothesis testing the probabilility of making a Type I or a Type II error can be quantified and controlled. The probability of making a Type I error is known as the tests significance level and is referred to as $\alpha$ or alpha. The probability of a Type II error is known as $\beta$. The power of a test is $1-\beta$. Reducing the $\alpha$ value increases the $\beta$ value and vice versa. The only way to reduce $\beta$ without increasing $\alpha$ is by increasing the sample size. The value of $\alpha$ is usually set as 0.05 or 0.01. The null hypothesis is rejected if the sample statistic is far enough away from the hypothesis value.
When the observed and expected cell counts are compared, if the differences between them are too big to be explained by sampling variation then the null hypothesis is rejected in favour of the alternative hypothesis. The bigger the difference, the less likely a Type I error will arise. The definition of how big is based on $\alpha$. A test statistic is calculated which belongs to a known distribution, in this case the $\chi^2$ frequency distribution.
The null hypothesis is rejected if the chi-squared test statistic is greater than some critical value. The critical value is chosen in such as way that the probability of a Type I Error is equal to $\alpha$ and to do this you need to know the sampling distribution of $\chi^2$ when the null hypothesis is true. (The distribution is approximately a chi-squared distribution with k-1 degrees of freedom.) The critical value divides the possible values of the test statistic into a critical rejection region and a non-critical non-rejection region. If the test statistic lies in the rejection region then the null hypothesis is rejected, otherwise we fail to reject the null hypothesis.
Large values of the chi-squared statistic suggest rejection of the null hypothesis in favour of the alternative hypothesis. The overall difference between what was observed and what would have been expected under the null hypothesis is larger. Smaller values of the chi-squared test statistic are compatible with the null hypothesis.
The Chi-squared value is compared to the Chi-squared critical value which takes into account the alpha level of significance and the degrees of freedom. If the chi-squared test statistic is greater than the critical value of chi at that alpha level for the number of degrees of freedom, then the null hypothesis of independence is rejected and we can conclude that the data contains sufficient evidence of neighbourhood preference by occupational categories. Without computers you would have to refer to a statistical tables such as the New Cambridge Statistical tables [17] and compare your calculated test statistic with the critical values, using the alpha level and the degrees of freedom. The chi-squared critical value from page 41 of Table 8 in the tables is 12.59 and since the the chi-squared test statistic is greater than the critical value we can reject the null hypothesis and conclude that the data contains sufficient evidence of neighbourhood preference by occupational classification.
The chi2_contingency
function from the SciPy Stats package calculates and provides all of the information required to make the same decision given the same data. It does not return the critical value but it does return the chi-squared test statistic, the p-value, the degrees of freedom and the expected frequencies.
The $\alpha$ value is the probability of rejecting the null hypothesis when it is true. It is generally chosen as 0.05 but a smaller alpha value of 0.01 can be used where more precision is required. If the p-value for the test is greater than the alpha value then the null hypothesis is not rejected.
The calculated p-value is 0.0004098425861096696 and this is less than the alpha level of 0.05 which means that the null hypothesis should be rejected in favour of the alternative hypothesis. We can conclude that the data supports the hypothesis that the choice of neighbourhood is not independent of occupational classification. The two categorical variables in this data are not independent. The data contains sufficient evidence of neighbourhood preference by occupational classification.
Note that the conditions of the Chi-squared test is that the approximation is good when every expected cell count is greater than or equal to 5, and the $n$ observed counts are a random sample from the population of interest.
The standard deviation of an array of numbers x is calculated using numpy as $np.sqrt(np.sum((x - np.mean(x))^2)/len(x))$ . However, Microsoft Excel has two different versions of the standard deviation calculation, STDDEV.P
and STDDEV.S
.
The STDDEV.P
function performs the above calculation but in the STDDEV.S
calculation the division is by len(x)-1
rather than len(x)
.
STDDEV.S
calculation is a better estimate for the standard deviation of a population when performed on a sample.
Note that part of this task is to figure out the terminology in the previous sentenceSTDDEV.P
and STDDEV.S
functions.¶To begin I refer to the Microsoft documentation here on these two Excel functions.
The standard deviation is a measure of how widely values are dispersed from the average value (the mean).
According to the Microsoft documentation, the STDEV function estimates standard deviation based on a sample. This function has been replaced with one or more new functions that may provide improved accuracy and whose names better reflect their usage. These are the STDDEV.P
and STDDEV.S
functions.
STDEV.S
assumes that its arguments are a sample of the population. The standard deviation is calculated using the "n-1" method. If your data represents the entire population, then compute the standard deviation using STDEV.P
.The aim of statistics is to make statements about the population using a sample. A parameter is a population characterestic while a sample statistic is any quantity computed from values in the sample. A sample of size $\text{n}$ is just one of many that can be drawn from a population of size $\text{N}$. While the value of a population parameter is fixed, the value of a statistic varies from sample to sample. It is a random variable and it's probability distribution known as it's sampling distribution.
The population mean takes the sum of every data point in the population and divide by the number of observations ($N$ data points) to get the population mean $\mu$. The sample mean $\bar{x}$ is the sum of the observations in a sample divided by the number of observations $\text{n}$. The next thing to consider is the dispersion of the data points from the mean. The variance is the the sum of the squared distances from the population mean divided by the total number of data points in the population, i.e. the average of the squared difference (of each data point) from the mean. The standard deviation, which is the square root of the variance, is the most commonly used measure of the dispersion of a data set. The larger the standard deviation, the greater the spread of the data.
A statistic is biased if, in the long run, it consistently over or underestimates the parameter it is estimating. More technically it is biased if its expected value is not equal to the parameter. A statistic is positively biased if it tends to overestimate the parameter; a statistic is negatively biased if it tends to underestimate the parameter. An unbiased statistic is not necessarily an accurate statistic. If a statistic is sometimes much too high and sometimes much too low, it can still be unbiased. It would be very imprecise, however. A slightly biased statistic that systematically results in very small overestimates of a parameter could be quite efficient. [20]
n-1 is usually used in the sample variance formula instead of n because using n tends to produce an underestimate of the population variance. For this reason n-1 is used to provide the appropriate correction for this tendency.
According to wikipedia, using n-1 instead of n in the formula for the sample variance and sample standard deviation is known as Bessel's Correction, named after Friedrick Bessel.
This method corrects the bias in the estimation of the population variance. It also partially corrects the bias in the estimation of the population standard deviation. However, the correction often increases the mean squared error in these estimations. In estimating the population variance from a sample when the population mean is unknown, the uncorrected sample variance is the mean of the squares of deviations of sample values from the sample mean (i.e. using a multiplicative factor 1/n). In this case, the sample variance is a biased estimator of the population variance. Multiplying the uncorrected sample variance by the factor $\frac{n}{n-1}$ gives an unbiased estimator of the population variance
Bessel's correction is the degrees of freedom in the residuals vector $x_1-\bar{x},...,x_n - \bar{x}$ where $\bar{x}$ is the sample mean. There are n independent observations in the sample but only n-1 residuals because they sum to one. Generally Bessel's correction is an approach to reduce the bias due to finite sample size.
So there is more than one way that the sample variance could be calculated. The biased sample variance $S_n$ is a non-unbiased estimator of the population variance. From every one of the n data points in the sample you substract the sample mean $\bar{x}$ and divide by the number of data points n. To get the unbiased estimate instead, you divide by n-1 instead of n. As n-1 is a smaller number than n, then dividing by the smaller number will give a larger value for the sample variance. This unbiased estimate is usually denoted by $S_{n-1}$.
The population variance is calculated using the formula: $$\sigma^2\frac{=\sum(x_i-\mu)^2}{n}$$
The sample variance is calculated using the formula:
$$S^2\frac{=\sum(x_i-\bar{x})^2}{n-1}$$The task requires using Python's NumPy package to perform a simulation demonstrating that the STDDEV.S
calculation is a better estimate for the standard deviation of a population when performed on a sample.
The STDDEV.S function estimates standard deviation based on a sample, assumes that its arguments are a sample of the population and therefore the standard deviation is calculated using the "n-1" method. If the data represented the entire population, then the standard deviation should be computed using STDEV.P
.
Numpy has a function std
for calculating the standard deviation of an array of numbers. Referring to the numpy.std documentation, Numpy's standard deviation std
is calculated as the square root of the average of the squared deviations from the mean, i.e. std = sqrt(mean(abs(x - x.mean())**2))
where the average squared deviations uses x.sum() / N
where N = len(x)
. This makes it similar to Excels STDEV.P
function.
The std
function can take some optional parameters though, one of which is ddof
which means delta degree of freedom can be specified. When the default of zero is used, the divisor is N but if this parameter is specified then the divisor becomes N-ddof
. Therefore if this parameter is set to 1, the formula for the standard deviation changes from being the population standard deviation to the sample standard deviation.
The Numpy documentation notes that:
in standard statistical practice,
ddof=1
provides an unbiased estimator of the variance of the infinite population.ddof=0
provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even withddof=1
, it will not be an unbiased estimate of the standard deviation per se.[21]
Usually we are interested in measurements of a population but it is not practical to measure this parameter. Therefore we take representative samples from the population and get the average of the samples instead. A statistic is a good estimator for a population parameter and measures estimates of population parameters in a sample. The average of a sample taken from a population is unlikely to be the actual true average for the population. This process can be repeated many times though. NumPy random distribution functions can be used to imitate this process by simulating many many samples of a set sample size from a distribution. The mean and standard deviations from the samples can then be used to get the sampling distribution of the mean. While some random samples may have randomly big or randomly small values, overall you could take the average from each of the samples and relate these to the mean of the overall population you took them from.
Below I will simulate a population using NumPy's random module and from the population select some samples and calculate some sample statistics, then compare the sample statistics with the population parameters. I could use the standard normal distribution as the population, it has a mean of zero and a standard deviation and variance of 1. Alternatively any normal distribution can be used as the population by specifying the $\mu$ and $\sigma^2$ or any other distribution such as the Uniform distribution. In general you could take the average of the sample statistics from any population distribution and see how they relate to the population that the samples are drawn from. The mean is an estimator of the centre of the distribution. If all the possible samples were taken and the sample statistics were calculated were calculated for each sample, you could then take the mean of all the sample means and expect this to be close to the mean of the population and therefore a good estimator of the population mean. If you calculated the mean and variance for each of the samples and looked to see how they were distributed you would have the sampling distribution of the mean.
We would expect to see that sample statistics calculated from somes samples would be close to the population parameters but others might vary somewhat. This depends largely on the size of the sample and the actual random values that were drawn from the population which will vary from sample to sample. Some of the random samples will have means that randomly quite big or small or even more extreme. It depends on where exactly the samples came from in the population.
Given the background above we should expect to see that when using n instead of n-1 for the variance and standard deviation calculations that the true variance and standard deviations are being underestimated but when using n-1 they should be closer to the true population parameters.
I will sample 1,000 samples each containing 10 points, then calculate the sample mean, sample variance and sample standard deviation for each sample.
The variable samps
will hold the 1,000 samples of 10 data points. The mean of samps
will not be exactly 0 as it is 10,000 points sampled from the standard normal distribution with mean 0 but it should be fairly close.
The same analysis could be easily applied to a normal distribution with a specified mean and standard deviation by specifying the loc
and scale
parameters as well as the size
and substituting np.random.normal
for np.random.standard_normal
in the code below.
# import libraries using common alias names
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 12, 8 # makes plots bigger
The data below is generated from the standard normal distribution which has a mean of 0 and a variance of 1. In a standard normal distribution, 99.73% of the value fall between -3 and 3.
plt.rcParams['figure.figsize'] = 12, 8 # makes plots big
fig = sns.histplot(np.random.standard_normal((100000)),label="$\mu=0,\ \sigma=1$")
#plt.legend(loc="best")
plt.xlabel=('mean')
plt.ylabel=("Standard deviation")
#plt.title("Histogram")
plt.suptitle('Standard Normal Distribution with $\mu$=0,$\sigma^2 =1$')
plt.show(fig)
Numpy.random's standard_normal
numpy.random.standard_normal(size=None)
function samples from a standard Normal distribution (mean=0, stdev=1).
size
parameter can be provided as an integer to return an array containing a sample of that size. In the code below 1,000 samples of size 10 are generated from a standard normal population. In a standard normal distribution, 99.73% of the value fall between -3 and 3. Without setting a random seed, each time this code is run gives different statistics.
(The same analysis could be applied to a normal distribution with a specified mean and standard deviation by specifying the loc
and scale
parameters. For example in the code below use samps = np.random.normal(5, 2, size = (no_samples, sample_size))
to perform the same type of analysis on a standard distribution with a mean of 5 and a standard deviation of 2.)
import numpy as np
np.set_printoptions(formatter={'float': lambda x: "{0:6.3f}".format(x)})
sample_size = 10
# 1000
no_samples = 1000
#size of the sample here is a tuple(m,n) to get m * n samples
samps = np.random.standard_normal((no_samples, sample_size))
# alternatively to run the analysis on a standard distribution
#samps = np.random.normal(5, 2, size = (no_samples, sample_size))
print(f"The mean of all the samples is {samps.mean()}")
print(f"The sample has a {samps.size} points")
print(samps.shape)
The 10,000 data points generated were put into 1,000 sub-arrays of size 10 to represent 1,000 samples each containing 10 data points in each. The mean of each sample of 10 can be obtained by applying the mean()
function (at axis level 1) or manually by summing each (sample) array of 10 data points and dividing by the number of data points in the sample (10).
Similarly the variances and standard deviations can be calculated using the formula or applying the NumPy var
and std
functions.
# the mean of all the sample means
sample_means = samps.mean(axis=1)
# samp.sum(axis=1) / sample_size
#print(sample_means)
# the mean of the sample means?
print(f"The mean of the sample means is {sample_means.mean()}")
print(f"The variance of the sample means is {sample_means.var()}")
print(f"The variance of the sample means is {sample_means.var(ddof=1)}")
print(f"The standard deviation of the sample means is {sample_means.std()}")
print(f"The standard deviation of the sample means is {sample_means.std(ddof=1)}")
# an array containing all the means of the samples
samps.mean(axis=1)
# manually calculating the mean of each sample using the foruma
mean = samps.sum(axis=1) / sample_size
# array of the means of the samples
means = samps.mean(axis=1)
# array of the variances (biased using n)
vars_b = samps.var(axis=1)
# array of the variances (unbiased using n-1)
vars_ub = samps.var(axis=1, ddof=1)
# array of the standard deviations using biased (n)
stds_b = samps.std(axis=1)
# array of the sample standard deviations using unbiased (n-1)
stds_ub = samps.std(axis=1, ddof=1)
Here the sample variances and standard deviations are calculated manually using the formulas noted above.
# newaxis converts the array into column vector
diff_from_mean = (samps - mean[:, np.newaxis])**2
# Calculate the biased variance for each of the samples using n
variances_b = diff_from_mean.sum(axis=1) / (sample_size)
# Calculate the unbiased variance for each of the samples using n-1
variances_ub = diff_from_mean.sum(axis=1) / (sample_size-1)
# Calculate the biased standard deviations for each of the samples using n
standard_deviations_b = np.sqrt(variances_b)
# Calculate the ubbiased standard deviations for each of the samples using n-1
standard_deviations_ub = np.sqrt(variances_ub)
var
and std
.¶(To run this cell highlight the cell and click y
).
Using the formula for the biased variance gives a mean variance that is the exact same as using the numpy var()
formula. Adjusting the variance formula to divide the differences from the mean by $n-1$ gives a mean variance that is very close to the true variance of the population.
Multiplying the uncorrected sample variance by the factor $\frac{n}{n-1}$ gives an unbiased estimator of the population variance.
# using the var() formula
print(f"The mean of the variances using np.vars() is {vars_b.mean()}")
# the mean of the biased variance formula
print(f"The mean of the biased variance (variance calculated using n) is {variances_b.mean()}")
# mean of the unbiased formula
print(f"The mean of the unbiased variances (variance calculate using n-1) is {variances_ub.mean()}")
# the actual population variance (if my sample of 10000 points is taken as the full population for this demo)
print(f"The actual population variance is {samps.var()}")
# using Bessel correction by multiplying the biased sample variance by n/n-1
print(f"Using Bessel's correction gives {variances_b.mean() * (sample_size / (sample_size - 1.0))}")
print(f"The difference between the population variance and the means of the biased variances is {samps.var() - variances_b.mean()}")
print(f"The difference between the population variance and the means of the unbiased variances is {samps.var() - variances_ub.mean()}")
# using the std() formula
print(f"The mean of the standard deviations using np.std() is {stds_b.mean()}")
# the mean of the biased standard deviation formula
print(f"The mean of the biased standard deviation (standard deviation calculated using n) is {stds_b.mean()}")
# mean of the unbiased formula
print(f"The mean of the unbiased standard deviations (using n-1) is {stds_ub.mean()}")
# the actual population standard deviation (if my sample of 10000 points is taken as the full population for this demo)
print(f"The actual population standard deviation is {samps.std()}")
print(f"The difference between the population standard deviation and the means of the biased standard deviations is {samps.std() - stds_b.mean()}")
print(f"The difference between the population standard deviation and the means of the unbiased standard deviation is {samps.std() - stds_ub.mean()}")
The above comparisons show that using the unbiased variances and unbiased standard deviations on the samples drawn from the population give a closer estimate of the true population parameters than the unbiased formulas.
If the sample size for each sample drawn from the population is made smaller, the difference between the population standard deviation and the mean of the sample sample deviations is more notable. Note in the above I used a large sample drawn from the standard normal distribution as my population. The results could also be compared with the population parameters of the standard normal distribution which has a variance and standard deviation of 1.
Below I will plot the sample statistics from each of the samples and compare to the population parameters. (This is something I saw on a Khan Academy video but I don't know what application was actual used for plotting there. It didn't appear to be Python.) The sample mean and sample variance for each of the samples are calculated and plotted on the graph. A vertical line represents the actual population mean while a horizontal line represents the population variance. khan Academy[22]
x = sample_means
y = stds_b
z = stds_ub
a = stds_b.mean()
# mean of the unbiased formula
b = stds_ub.mean()
%matplotlib inline
# set up the subplots and figure sizes
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.scatterplot(x =x,y=y, hue=y, ax=axes[0])
sns.scatterplot(x =x,y=z, hue=y, ax=axes[1])
axes[0].axvline(samps.mean(), color='blue', linewidth=2, linestyle="--")
axes[1].axvline(samps.mean(), color='blue', linewidth=2, linestyle="--")
axes[0].axhline(samps.std(), color='red', linewidth=2, linestyle=":")
axes[0].axhline(stds_b.mean(), color='green', linewidth=2)
axes[1].axhline(samps.std(), color='red', linewidth=3, linestyle=":")
axes[1].axhline(stds_ub.mean(), color='cyan', linewidth=2)
axes[0].set(xlabel='Sample Means', ylabel='Sample Standard Deviations')
axes[1].set(xlabel='Sample Means', ylabel='Sample Standard Deviations')
axes[0].set_title("Using biased standard deviations")
axes[1].set_title("Using unbiased standard deviations")
plt.show()
The Khan academy video [22] also looked at the size of the samples. With a smaller sample size, you are much more likely to get a sample mean that is a bad estimate for the population mean and you are more likely to significantly underestimate the sample variance. Their demo showed that the cases where the sample variance is significantly underestimated are also disproportionately the cases where the means of those samples are way off from the true sample mean. For example, if you took many small samples of size 2 from the population, over many trials with samples of size 2 you see that the ratio of the biased sample variance to the population variance approaches half of the true population variance. With a sample size of 3 the ratio approaches two thirds of the true population variance. For sample size of 4 the ratio approaches $\frac{3}{4}$ of the true population variance. This illustrates that when using the biased estimate instead of approaching the population variance we approach $\frac{n-1}{n}$ times the population variance. Therefore to get the best estimate, an unbiased estimate of the true population variance, multiply by $\frac{n}{n-1}$ to get an unbiased estimate. $$\frac{(n-1)\sigma^2}{n}.\frac{n}{n-1}=\sigma^2$$
Below I have plotted the means of the sample standard deviations for various sample sizes between 2 and 6. The plots on the left used n in the denominator while the plots on the right used n-1.
The green horizontal line on the plot on the left (representing the mean of the biased standard deviations) moves further away from the population mean as the sample size gets smaller. When the sample size is only two, the green line is just over 0.5. For a sample size of 3, the mean of the biased standard deviations is above 0.6, for a sample size of 4 it is approx 0.8. Meanwhile the line representing the mean of the standard deviations calculated using n-1 do not vary as much.
The plots above show that the means are clusted around the population mean of zero which is shown by the vertical blue dashed line. Some sample means are below and some are above and some more are quite a bit away from the population mean. This would depend on the actual values in each sample. Similarly with the standard deviations. The dashed red horizontal line represents the population standard deviation of 1. Again while the sample sample standard are clustered around them, some sample standard deviations are above, some below and some more extreme than others. However when the mean of the sample standard deviations are plotted you can see that the mean of the unbiased standard deviations are closer to the actual standard devision of the population than the mean of the biased sample standard deviations.
def plotx(s):
sample_size = s
# 1000
no_samples = 1000
#size of the sample here is a tuple(m,n) to get m * n samples
samps = np.random.standard_normal((no_samples, sample_size))
# the mean of all the sample means
sample_means = samps.mean(axis=1)
# array of the means of the samples
means = samps.mean(axis=1)
# array of the variances (biased using n)
vars_b = samps.var(axis=1)
# array of the variances (unbiased using n-1)
vars_ub = samps.var(axis=1, ddof=1)
# array of the standard deviations using biased (n)
stds_b = samps.std(axis=1)
# array of the sample standard deviations using unbiased (n-1)
stds_ub = samps.std(axis=1, ddof=1)
x = sample_means
y = stds_b
z = stds_ub
a = stds_b.mean()
# mean of the unbiased formula
b = stds_ub.mean()
# set up the subplots and figure sizes
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.scatterplot(x =x,y=y, hue=y, ax=axes[0])
sns.scatterplot(x =x,y=z, hue=y, ax=axes[1])
axes[0].axvline(samps.mean(), color='blue', linewidth=2, linestyle="--")
axes[1].axvline(samps.mean(), color='blue', linewidth=2, linestyle="--")
axes[0].axhline(samps.std(), color='red', linewidth=2, linestyle=":")
axes[0].axhline(stds_b.mean(), color='green', linewidth=2)
axes[1].axhline(samps.std(), color='red', linewidth=3, linestyle=":")
axes[1].axhline(stds_ub.mean(), color='cyan', linewidth=2)
axes[0].set(xlabel='Sample Means', ylabel='Sample Standard Deviations')
axes[1].set(xlabel='Sample Means', ylabel='Sample Standard Deviations')
axes[0].set_title("Using biased standard deviations")
axes[1].set_title("Using unbiased standard deviations")
plt.show()
for i in range(2,8):
plotx(i)
If a sample is taken from this population, the sample mean of this sample could be very close to the population mean or it might not be depending on which data points are actually in the sample and if the sample points came from one end or other of the underlying population. The mean of the sample will always fall inside the sample but the true population mean could be outside of it. For this reason you are more likely to be underestimating the true variance. The standard deviation is the square root of the variance. Dividing by n-1 as shown above instead of n will give a slightly larger sample variance which will be an unbiased estimate.
Use scikit-learn to apply k-means clustering to Fisher’s famous Iris data set. You will easily obtain a copy of the data set on- line. Explain in a Markdown cell how your code works and how accurate it might be, and then explain how your model could be used to make predictions of species of iris.
(I am not going to go into the details of the Iris dataset here and just focus on the the kmeans clustering. This was covered in another project to which I can refer to if needs be.)
The dataset is available from the UCI Machine Learning repository[23]
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
It consists of 150 instances with five attributes, with four of these being the measurements of the sepal and petals of each observation in the data set and the fifth being the class or species of Iris that each observation belongs to. It includes 50 plants each of three classes of Iris plant, where each class is a different type or species of Iris plant.
Clustering is an unsupervised learning method that is used to group sets of objects based on similar characteristics and to find meaningful patterns in the data. It is closely related to classification and can be used to find and classify groups of similar points in a dataset. Cluster analysis involves looking for groups of similar observations in data. The aim is to find clusters where the differences between observations within a cluster is much smaller than the distance between clusters.
In Unsupervised learning, the training data usually consists of a set of inputs without any corresponding target values or labels. One of the goals of unsupervised learning is to discover groups or clusters of similar examples within the data. It is also used to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.
There are various clustering algorithms, one of which is the k-means algorithm. K-means is an iterative process. You begin by picking the number of clusters to use (k), then randomly generate k number of centres, label the points according to which centre it is closest to. The cluster centres which are referred to as centroids are initialised using k random values to begin and the first set of k clusters created based on these random centre points. Every point in the dataset is assigned to a cluster based on its nearest centroid. The mean of all the points assigned to each cluster is calculated and this is used to recalculate the new centroid.
The values for the cluster centres are then updated, the points labelled according to the centre it is closest to. The values for the centres are again updated and the points once again reassigned labels according to which centre it is closest to. This dual process of allocating and updating is repeated until the algorithm converges.
This happens when there is no more points moved between points on each iteration. Once this occur the estimates for the cluster centres will no longer change and no more points will be reallocated from one cluster to another.
The scikit-learn cluster tutorial[24] explains how the k-means algorithm clusters data by trying to separate samples in $n$ groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares.
The k-means algorithm divides a set of $N$ samples $X$ into $K$ disjoint clusters $C$, each described by the mean $\mu_j$ of the samples in the cluster. The means are commonly called the cluster “centroids” but they are not in general points from $X$ although they live in the same space.
The K-means algorithm aims to choose centroids that minimise the inertia or within-cluster sum-of-squares criterion:
$$\sum_{i=0}^{n}{\min\limits_{\mu_j \in C}(||x_i - \mu_j||^2)}$$The $||.||^2$ refers to squared Euclidian distances or "sum of squares". In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.[25] The SSE is the sum of the squared error which is calculated after the centroids converge or match the previous iterations assignment. The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid. The objective of k-means is to try to minimize this value which is a measure of error.
Inertia is a measure of how internally coherent clusters are.
The initial centroids are chosen randomly and therefore the alogorithm will give different results every time it is run but the differences should be very small. The kmeans algorithm should be run several times to find the best clusters. sklearn.cluster.KMeans
takes care of this for us as by default it runs the k-means algorithm 10 times with different centroid seeds and the final results will be the best output of however many consecutive runs in terms of inertia. The default can be changed using the n_init
parameter.
The number of clusters to use must be decided. While we know the number of classes for the Iris dataset, in reality the number of clusters may not be known. A scatter plot of the data will give a good indication of clusters. The metric used for k-means is called the within sum of squares. This is plotted against the number of clusters k to give an elbow curve. The optimum number of clusters is where the elbow of the curve occurs. I will show this below. You must also be able to assess the results of the clustering to see how well it is working. The actual clusters are not usually known.
Cluster quality can be evaluated by computing the average silhouette score or the within-cluster sum-of-squares. K-means works best on data which can be represented as distinct, spherical, dense groups of points. In “fancy words”, the data can be modelled using a Gaussian Distribution It is also assumed that the clusters are roughly the same size.[26]
The sklearn.KMeans documentation includes an example of k-means clustering on the Iris dataset which is built into the Scikit-learn
package.
For this task I will download the csv file as this will be more realistic. Some preprocessing will be required to get it into the format required for using k-means. In unsupervised learning the data is generally not split into training and test sets.
The Iris dataset is commonly available in csv format containing 150 rows with 5 columns. The first 4 columns contain measurements for the 4 features while the 5th column contains the Iris class or labels that each observation belongs to. To apply k-means the data needs to be an array like sparse matric with a shape of 150 (n_samples) by 4 (n_features).
X{array-like, sparse matrix} of shape (n_samples, n_features)
# first importing the following libraries
# Numerical arrays.
import numpy as np
import pandas as pd
# plotting
import matplotlib.pyplot as plt
import seaborn as sns
import csv
# Machine learning - KMeans.
import sklearn.cluster as skcl
# download the data
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# using the attribute information as the column names
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']
# read into a pandas dataframe
iris = pd.read_csv(csv_url, names = col_names)
print(iris.head())
print(iris.tail())
print(iris.dtypes)
#iris.describe()
sns.set_palette("colorblind")
y = iris['Class']
y
First I will plot the data using Seaborn pairplot to visualise the relationship between the 4 features. The dataset contains 50 samples from each of the three species of Iris: Iris Setosa, Iris Virginica and Iris Versicolor. Each sample has 4 features which are the length and width measurements of the petals and sepals of each sample. The points are coloured by the Iris species class that they belong to. As the pairplot shows, the 3 classes of Iris are clearly identifiable when the petal length is plotted against the petal width. The Setosa class is nearly always identifiable by each of the 4 features. The most difficult to identify correctly would be the Versicolor and Virginica based on the sepal length and sepal width features. When sepal length is plotted against petal width, nearly all of the 150 observations can be correctly identified apart from a few Iris Virginicas which are similar to measurements for the Versicolor class.
sns.pairplot(iris, hue = "Class")
plt.show()
K-means is a distance-based algorithm and therefore differences in magnitude of values can create a problem. Looking at the Iris data, there is not a huge variation in the magnitude of the values. The measurements are in centimetres. Although all of the measurements are in single digits, there are still quite a bit of variation between the range of values. Petal width measurements range from 0.1 to 2.5 cm, petal width from 1 to 6.9cm, sepal width from 2 to 4.4cm and sepal length from 4.3 to 7.9cm. The greatest differences in magnitude are between sepal length and petal width.
Feature scaling is the process of transforming numerical features to use the same scale. The scikit-learn preprocessing module can be used to bring all the variables to the same magnitude.
It’s an important data preprocessing step for most distance-based machine learning algorithms because it can have a significant impact on the performance of your algorithm.[27]
The sklearn.preprocessing package can be used to standardise the dataset.
Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.[28]
The Iris Class
is a categorical variable with 3 possible values representing the species of Iris - Iris Setosa, Iris Versicolor and Iris Virginica. The sklearn.preprocessing
package also has functionality for converting these values into integer codes.
The preprocessing.LabelEncoder function encodes target labels with value between 0 and n_classes-1. (This transformer should be used to encode target values, i.e. y, and not the input X.)
# the feature columns of the dataset
X = iris.iloc[:, 0:4]
X.describe()
The StandardScaler and scale
are equivalent functions. 'StandardScalerstandardise
features by removing the mean and scaling to unit variance. Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.
Scale the data so that it has zero mean and unit variance
# from sklearn import preprocessing
import sklearn.preprocessing as pre
# scaled the data
X_scaled = pre.scale(X)
X_scaled
# the mean should be 0
print(X_scaled.mean(axis=0))
# the standard deviations should be 1
print(X_scaled.std(axis=0))
X_scaled.std(axis=0)
X_scaled;
scaler = pre.StandardScaler().fit(X)
scaler
scaler.mean_
scaler.scale_
#X_scaled =
scaler.transform(X) == X_scaled;
Using the sklearn.preprocessing.LabelEncoder
class to encode the categorical labels as integers 0, 1 and 2.
# get the class labels from the dataset (Versicolor, Setosa and Virginica)
# https://stackoverflow.com/questions/34165731/a-column-vector-y-was-passed-when-a-1d-array-was-expected
y = np.array(iris[['Class']]).ravel()
print(y.shape)
# set the label encoder
le = pre.LabelEncoder()
# fit the label encoder
le.fit(y)
# Fit label encoder and return encoded labels
yencoded=le.transform(y)
yencoded
#Transform labels back to original encoding.
le.inverse_transform(yencoded)
For this section I referred to several blog articles referenced below. [28-30]
There are two complementary methods that can be used to evaluate the appropriate number of clusters to use, the elbow method and the sihouette coefficient. To perform the elbow method you run the k-means algorithm several times, incrementing the number of clusters k
with each iteration. The SSE is recorded for each iteration.
The elbow curve above can help decide what is the appropriate number of clusters to use. The x-axis represents the number of clusters and the y-axis is an evaluation metric such as inertia which is usually the within sum of squares. The model is trained using 1 or 2 clusters first, the inertia (within sum of squares) is calculated for that model and then plotted. Next the number of clusters is increased by 1, the model is trained again and the inertia value plotted.
There is a clear drop in inertia when the number of clusters goes from 1 to 2. There is a not so sharp decrease in inertia as we go from 2 to 3 and a small decrease from 3 to 4 clusters and a very small decrease from 4 to 5 clusters. After 5 clusters the decrease in inertia becomes constant. Where the inertia value becomes constant can be chosen as the correct number of clusters for the data. This is because the inertia or
The computation cost must also be considered though. Increasing the number of clusters increases the computing cost.
The KMeans
functions has some parameters that can be set including:
-n_init
is the number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init
consecutive runs in terms of inertia. The default is 10.
max_iter
is the maximum number of iterations of the k-means algorithm for a single run. The default is 300# fitting multiple k-means algorithms and storing the values in an empty list
Inertia = []
nclust = 8
# perform kmeans on various number of clusters and calculating the inertia
for cluster in range(1,nclust):
kmeans = skcl.KMeans(n_clusters = cluster, init='k-means++', random_state = 0)
kmeans.fit(X_scaled)
Inertia.append(kmeans.inertia_)
# converting the results into a dataframe and plotting them
res = pd.DataFrame({'Cluster':range(1,nclust), 'Inertia':Inertia})
plt.figure(figsize=(12,6))
plt.plot(res['Cluster'], res['Inertia'], marker='o')
plt.title("Selecting the optimum number of clusters to use")
plt.xlabel= ('Number of clusters')
plt.ylabel=('Inertia')
plt.show()
Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of
[-1, 1]
. [32]
Larger values indicate that the samples are closer to their cluster than to the other clusters. The silhouette coefficient is a measure of cluster cohesion and separation which quantifies how well a data point fits into its assigned cluster based on how close the data point is to other points in the same cluster, and how far away the data point is from points in the other clusters.[27]
The average silhouette coefficient is summarised into one score using sklearn.metrics.silhouette_score.
Rukshan Pramoditha[33] has a very good tutorial on K-Means Clustering with Scikit-learn on Medium.com.
Here the silhouette_score
is imported from the sklearn.metrics
module and applied to a varying number of clusters in a similar manner to the elbow curve earlier.
# fitting multiple k-means algorithms and storing the values in an empty list
from sklearn.metrics import silhouette_score
silhouette_coefficients = []
nclust = 8
# apply kmeans using various numbers of cluster and calculate the silhouette score.
for cluster in range(2,nclust):
Skmeans = skcl.KMeans(n_clusters = cluster, init='k-means++', random_state = 0)
Skmeans.fit(X_scaled)
score = silhouette_score(X_scaled, Skmeans.labels_)
print(f"Silhouette score for {cluster} clusters is {score}")
silhouette_coefficients.append(score)
# converting the results into a dataframe and plotting them
res = pd.DataFrame({'Cluster':range(2,nclust), 'Silhouette':silhouette_coefficients})
plt.figure(figsize=(12,6))
plt.plot(res['Cluster'], res['Silhouette'], marker='o')
plt.title("Sihouette Score to chose number of Clusters")
plt.xlabel= ('Number of clusters k')
plt.ylabel=('Silhouette Score')
plt.show()
I did both the elbow curve and the silhouette score on the standardised Iris dataset. They both gave different results. The elbow curve implies that the appropriate number of clusters is 3 while the silhouette scores suggests only two clusters.
From the various blog posts I saw there is another library which can be used to very easily plot the silhoutte diagram. This is the yellowbrick.cluster library which extends the Scikit-Learn API to make model selection and hyperparameter tuning easier. See
Now once we have decided on the number of clusters to use, the next step is to perform the k-means clustering on the data.
Several parameters can be set for the kmeans algorithm in the KMeans
estimator class of scikit-learn. They are all outlined in the sklearn Kmeans documents but here are a few.
n_clusters
which defaults to 8init
is the initialisation method. If it is set to random
it will choose n_clusters
observations (rows) at random from data for the initial centroids. Setting to k-means++’
selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.n_init
sets the number of initializations to perform, the number of times the k-means algorithm will run with different centroid seeds. The default is to perform 10 kmeans runs. The final results will be the best output of n_init
consecutive runs in terms of inertia.
max_iter
sets the number of maximum iterations for each initialization of the k-means algorithm. The default is 300 iterations of k-means for a single run.
The KMeans class is instantiated and then it is fit to the scaled data. Here I initialised using k-means++
instead of random. Earlier I let it default to random.
# Perform kmeans fitting - this instantiates the KMeans class with these arguments.
kmeans = skcl.KMeans(n_clusters=2, random_state=0, init='k-means++').fit(X_scaled)
After calling .fit()
method we can get the statistics from the initialization run with the lowest SSE as attributes of kmeans.
kmeans.inertia_
The cluster centres can be found using the cluster_centers_
attribute to get the coordinates of the cluster centers. It returns an array with the shape depending on the number of clusters used and the number of features.
In this case we have 3 clusters and 4 features.
For example the first sub-array represents the coordinates of the cluster centres for cluster 1.
The final locations of the centroids are returned.
# get the cluster centres
centres = kmeans.cluster_centers_
centres
kmeans.n_iter_
The fit
method computes the k-means clustering and returns the fitted estimator.
The k-means clustering algorithm assigns each observation to a cluster. The labels attribute can be accessed using labels_
on the kmeans object.
For the Iris dataset we already know that there are 3 classes or species of Iris: the Iris-setosa, the Iris-versicolor and the Iris-virginica. The first 50 observations in the dataset are classified as the Iris-setosa, the next 50 observations are classified as Iris-versicolor and the final 50 are classified as Iris-virginica. These were encoded earlier as labels 0, 1 and 2 respectively. The actual Iris Class / Species labels were not given to the k-means algorithm. The order of the labels assigned to the clusters might not match the actual labels (the encoded labels given earlier). For example the first bunch of labels assigned below are 1 even though the first groups of rows in the dataset were assigned 0 as the labels when encoded from Iris-setosa to a number. This is ok as the ordering of the cluster labels depends on the initialisation and the label for the cluster could change between runs.
kmeans.labels_
# as a reminder of the columns of the dataset
iris.columns.values
Here I am assigning the predicted labels to each observation in the (scaled) dataset. This will assign one of the 3 labels 0, 1 or 2 to each observation. The predicted labels can then be used to subset the dataset to pull out the rows of measurement values or subsetted further to pull out a single measurement variable such as Sepal_Length or Sepal_Width.
The predict
method can then be called on the results of fit
to compute cluster centers and predict cluster index for each sample. It is used to predict the closest cluster each observation belongs to.
It returns an ndarray of labels containing the index of the cluster each sample belongs to.
The result of calling predict
on the input data is the predicted labels for the classes each observation might belong to.
# get the predicted labels for each sample point and store in y_labels
y_labels = kmeans.fit_predict(X_scaled)
# the predicted y labels (species)
y_labels
#kmeans.labels_ ==y_labels
# this gets the datapoints from the dataset that were assigned to each label
X_scaled[kmeans.labels_ == 0] #
# this will pull out the first column Sepal_Length for observations as label 0
X_scaled[kmeans.labels_ == 0][:,0]
# Sepal_Width column for predicted label 1
X_scaled[kmeans.labels_ == 0][:,1] # this will pull out the observations for the first column Sepal_Width
# pull out the sepal length measurements for the observations predicted as label 0.
#X_scaled[kmeans.labels_ == 0][:,0]
X_scaled[y_labels == 0, 0]
Next I will visualise the clusters that the observations have been assigned to using scatter plot. A scatter plot shows 2 variables plotted against each other. There are 4 features in this Iris dataset. (Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width)
Note that the cluster labels 0, 1 and 2 may change when the clustering is rerun although the clusters themselves will not. Therefore the colours on the plot might not stay the same.
I moved the cluster plots to be besides the actual plots to be able to compare the clusters found using k-means to the actual groups by Iris species. The original Iris dataset is plotted so it has the original scale on the axis whereas the clustered data is using the scaled data but the plots remain the same.
The scatterplot with the datapoints coloured by their actual species shows that for most data points the Setosa (red) class could be clearly separated from the other two classes by a straight line. The separation between the other two species is not as clear though.
# the coordinates of the cluster centres
centres = kmeans.cluster_centers_
centres
[ 1.13597027, 0.09659843, 0.996271 , 1.01717187]
# cluster centre for sepal length
kmeans.cluster_centers_[:, 0]
#kmeans.cluster_centers_[:,1]
# pulling out the different species into separate dataframes for plotting
iris_setosa = iris[iris['Class'] == "Iris-setosa"]
iris_versicolor = iris[iris['Class'] == "Iris-versicolor"]
iris_virginica = iris[iris['Class'] == "Iris-virginica"]
# Plot size.
plt.rcParams['figure.figsize'] = (14, 6)
plt.figure()
# subplot for the actual classes of species, set the colour, size etc
plt.subplot(121)
plt.scatter(iris_setosa['Sepal_Length'], iris_setosa['Sepal_Width'], s = 15, c = 'red', label = 'Setosa')
plt.scatter(iris_versicolor['Sepal_Length'], iris_versicolor['Sepal_Width'], s = 15, c = 'blue', label = 'Versicolor')
plt.scatter(iris_virginica['Sepal_Length'], iris_virginica['Sepal_Width'], s = 15, c = 'green', label = 'Virginica')
# turning off the axes scales
plt.gca().axes.get_yaxis().set_visible(False)
plt.gca().axes.get_xaxis().set_visible(False)
plt.subplot(122)
# subplot for the clusters found with the final cluster centre marked
plt.scatter(X_scaled[y_labels == 0, 0], X_scaled[y_labels == 0, 1], s = 15, c = 'red')
plt.scatter(X_scaled[y_labels == 1, 0], X_scaled[y_labels == 1, 1], s = 15, c = 'blue')
plt.scatter(X_scaled[y_labels == 2, 0], X_scaled[y_labels == 2, 1], s = 15, c = 'green')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 40, c = 'orange', label = 'Centroids')
#plt.axis('off')
plt.suptitle("Comparison of Iris Species and Clusters for Sepal Length vs Sepal Width")
# https://stackoverflow.com/questions/37039685/hide-axis-values-but-keep-axis-tick-labels-in-matplotlib
plt.gca().axes.get_yaxis().set_visible(False)
plt.gca().axes.get_xaxis().set_visible(False)
plt.show()
# Plot size.
plt.rcParams['figure.figsize'] = (14, 6)
plt.figure()
# subplot for the actual classes of species, set the colour, size etc
plt.subplot(121)
plt.scatter(iris_setosa['Sepal_Length'], iris_setosa['Petal_Length'], s = 15, c = 'r', label = 'Setosa')
plt.scatter(iris_versicolor['Sepal_Length'], iris_versicolor['Petal_Length'], s = 15, c = 'b', label = 'Versicolor')
plt.scatter(iris_virginica['Sepal_Length'], iris_virginica['Petal_Length'], s = 15, c = 'g', label = 'Virginica')
# turning off the axes scales
plt.gca().axes.get_yaxis().set_visible(False)
plt.gca().axes.get_xaxis().set_visible(False)
plt.subplot(122)
# subplot for the clusters found with the final cluster centre marked
plt.scatter(X_scaled[y_labels == 0, 0], X_scaled[y_labels == 0, 2], s = 15, c = 'red')
plt.scatter(X_scaled[y_labels == 1, 0], X_scaled[y_labels == 1, 2], s = 15, c = 'blue')
plt.scatter(X_scaled[y_labels == 2, 0], X_scaled[y_labels == 2, 2], s = 15, c = 'green')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,2], s = 40, c = 'orange',label = 'Centroids')
plt.suptitle("Comparison of Iris Species and Clusters for Sepal Length vs Petal_Length")
# https://stackoverflow.com/questions/37039685/hide-axis-values-but-keep-axis-tick-labels-in-matplotlib
plt.gca().axes.get_yaxis().set_visible(False)
plt.gca().axes.get_xaxis().set_visible(False)
plt.show()
# Plot size.
plt.rcParams['figure.figsize'] = (14, 6)
plt.figure()
plt.subplot(121)
plt.scatter(iris_setosa['Sepal_Width'], iris_setosa['Petal_Width'], s = 15, c = 'r', label = 'Setosa')
plt.scatter(iris_versicolor['Sepal_Width'], iris_versicolor['Petal_Width'], s = 15, c = 'b', label = 'Versicolour')
plt.scatter(iris_virginica['Sepal_Width'], iris_virginica['Petal_Width'], s = 15, c = 'g', label = 'Virginica')
plt.gca().axes.get_yaxis().set_visible(False)
plt.gca().axes.get_xaxis().set_visible(False)
plt.subplot(122)
plt.scatter(X_scaled[y_labels == 0, 1], X_scaled[y_labels == 0, 3], s = 15, c = 'red')
plt.scatter(X_scaled[y_labels == 1, 1], X_scaled[y_labels == 1, 3], s = 15, c = 'blue')
plt.scatter(X_scaled[y_labels == 2, 1], X_scaled[y_labels == 2, 3], s = 15, c = 'green')
plt.scatter(kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:,3], s = 30, c = 'orange', label = 'Centroids')
plt.suptitle("Comparison of Iris Species and Clusters for Sepal Width vs Petal Width")
# https://stackoverflow.com/questions/37039685/hide-axis-values-but-keep-axis-tick-labels-in-matplotlib
plt.gca().axes.get_yaxis().set_visible(False)
plt.gca().axes.get_xaxis().set_visible(False)
plt.show()
Looking at the scatterplots of the points above for the Iris dataset on the left, for most points the Setosa class can be clearly separated from the other two classes by a straight line. The separation between the other two species Versicolor and Virginica is not as clear though. The scatter plots showing the clusters on the right show that all the Setosa appear to have been correctly assigned to the same cluster but there are a number of Veriscolors and Virginicas which have been assigned to the wrong cluster.
# Plot size.
plt.rcParams['figure.figsize'] = (14, 6)
plt.figure()
plt.subplot(121)
plt.scatter(iris_setosa['Sepal_Width'], iris_setosa['Petal_Length'], s = 15, c = 'r', label = 'Setosa')
plt.scatter(iris_versicolor['Sepal_Width'], iris_versicolor['Petal_Length'], s = 15, c = 'b', label = 'Versicolour')
plt.scatter(iris_virginica['Sepal_Width'], iris_virginica['Petal_Length'], s = 15, c = 'g', label = 'Virginica')
plt.gca().axes.get_yaxis().set_visible(False)
plt.gca().axes.get_xaxis().set_visible(False)
plt.subplot(122)
plt.scatter(X_scaled[y_labels == 0, 1], X_scaled[y_labels == 0, 2], s = 15, c = 'red')
plt.scatter(X_scaled[y_labels == 1, 1], X_scaled[y_labels == 1, 2], s = 15, c = 'blue')
plt.scatter(X_scaled[y_labels == 2, 1], X_scaled[y_labels == 2, 2], s = 15, c = 'green')
plt.scatter(kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:,2], s = 30, c = 'yellow', label = 'Centroids')
#plt.axis('off')
plt.suptitle("Comparison of Iris Species and Clusters for Sepal Width vs Petal Length")
# https://stackoverflow.com/questions/37039685/hide-axis-values-but-keep-axis-tick-labels-in-matplotlib
plt.gca().axes.get_yaxis().set_visible(False)
plt.gca().axes.get_xaxis().set_visible(False)
plt.show()
I have explained above how the code works. The plots above give an indication of how accurate the clustering is as the clusters found can be compared with the actual species. I will comment further on accuracy below. Selecting the correct number of clusters is important in determining how accurate the results are.
In Unsupervised learning, the training data usually consists of a set of inputs without any corresponding target values or labels. One of the goals of unsupervised learning is to discover groups or clusters of similar examples within the data so it is very useful in exploratory analysis because it can automatically identify structure in data. Unsupervised learning such as clustering, differs from supervised learning as while we have the actual truth to compare the results to with supervised learning, we do not with unsupervised learning.
Therefore, the goal of supervised learning is to learn a function that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data. Unsupervised learning, on the other hand, does not have labeled outputs, so its goal is to infer the natural structure present within a set of data points.[35]
The analysis above divided the data into clusters. If we were to try and predict the cluster for a new datapoint the values would have to be scaled in the same way. Here I pull out some observations from the Iris dataset, scale them and then see what cluster they are predicted to belong to. The first 50 rows of the dataset are Setosas, the secon
# selecting an observation from the dataset, scaling it and predicting cluster using k-means model
obs5 = iris.iloc[5, 0:4].values # Setosa
kmeans.predict(scaler.transform(obs5.reshape(1, -1))) # label 1
# selecting an observation from the dataset, scaling it and predicting cluster using k-means model
obs55 = iris.iloc[55, 0:4].values # Versicolor
kmeans.predict(scaler.transform(obs55.reshape(1,-1))) # label 2
obs145 = iris.iloc[140, 0:4].values # Virginica
kmeans.predict(scaler.transform(obs145.reshape(1,-1))) # label 2
obs20 = iris.iloc[20, 0:4].values # Setosa
kmeans.predict(scaler.transform(obs20.reshape(1,-1))) # label 1
obs120 = iris.iloc[120, 0:4].values # Virginica
kmeans.predict(scaler.transform(obs120.reshape(1,-1))) # label 0
It gets the above predictions correct but this one here 70 wrong.
obs70 = iris.iloc[70, 0:4].values # Versicolor
kmeans.predict(scaler.transform(obs70.reshape(1,-1))) # label 2
print(f"Kmeans score is {kmeans.score(X_scaled)}")
print(f"The inertia using 3 clusters is {kmeans.inertia_}")
print(f"The number of iterations was {kmeans.n_iter_}.")
X_scaled;
Clustering is an unsupervised learning method and therefore the data was not split into a training and test set. To evaluate performance of the model we can use the visualisation techniques above. With the Iris data set we have the actual labels and therefore can compare the actual groups to the clusters determined by the k-means algorithm. However this is not usually the case for clustering. It is mostly used when the clusters aren't known in advance to find patterns in the data that we might not know about already. The plots above showing the clusters as determined by the k-means clustering can be compared with the actual class species labels.
One of the goals of unsupervised learning is to discover groups or clusters of similar examples within the data. Unsupervised learning tends to be more subjective than supervised learning methods as there is no simple goal for the analysis, such as prediction of a response. Unsupervised learning is often performed as part of an exploratory data analysis. While it is possible to check results from supervised learning and to see how well a model predicts the response variable on observations not included in fitting the model, this is not the case in unsupervised learning and it is not so clear how to assess the results obtained compared to supervised learning.
One of the main challenges in unsupervised learning is to evaluate model performance. An inappropriate choice for k in k-means clustering can result in poor clustering performance. There are some standard methods to determine the optimal number of clusters k. These methods include both graphical and numeric techniques.[33]
The elbow and silhouette coefficients evaluate the performance of the clustering without the actual real labels. In the case of the Iris dataset this is the species of Iris that each of the 150 observations belongs to. With the Iris dataset you could compare every predicted label to the actual label to see how accurate the model is. In reality when doing clustering you do not always know the actual labels. Below I will compare the actual Iris Species labels toe the cluster labels to see how well the k-means performed on the dataset.
Here I refer to an article [36] on using the silhouette score concepts to evaluate the quality of the clusters created using the k-means algorithm. The silhouette score is calculated for each sample datapoint of the different clusters by finding the following distances for every single data point belonging to all the clusters.
The silhouette score S is then calculated for each sample using $(S = \frac{(b - a)}{max(a, b)})$. The values range from -1 to 1. A score of 1 means that the cluster is dense and well separated from other clusters while a value near 0 represents overlapping clusters with samples very close to the decision boundary of the neighbouring clusters. A negative score indicates that the samples might have got assigned to the wrong clusters.
The sklearn.metrics
packages has the following methods for evaluting the Silhouette scores.
The silhouette_score
for the data set is used for measuring the mean of the Silhouette Coefficient for each sample belonging to different clusters. It shows how dense and well separated the clusters are. It takes into account the intra-cluster distance between the sample and the other datapoints within the same cluster a and the inter-cluster distance between the sample and the next nearest cluster b.
The silhouette_samples
provides the Silhouette scores for each sample of different clusters.
The Silhouette plots can be used to select the most optimal number of clusters K in k-means clustering. They are interpreted by looking at silhoutte scores below the average silhouette score, wide variations in the size of the clusters and also the thickness of the silhouette plot.
# calculate the silhoutte score on the scaled data
score = silhouette_score(X_scaled, kmeans.labels_, metric='euclidean')
print('Silhouetter Score: %.3f' % score)
The YellowBrick machine learning visualisation library extends the Scikit-Learn API to make a model selection and hyperparameter tuning easier.
I first installed using pip install yellowbrick
# import the yellowbrick library
from yellowbrick.cluster import SilhouetteVisualizer
fig, ax = plt.subplots(2, 2, figsize=(15,8))
for i in [2, 3, 4, 5]:
# Create KMeans instance for different number of clusters
km = skcl.KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=100, random_state=42)
q, mod = divmod(i, 2)
# Create SilhouetteVisualizer instance with KMeans instance
visualizer = SilhouetteVisualizer(km, colors='yellowbrick', ax=ax[q-1][mod])
# Fit the visualizer
visualizer.fit(X)
From the above plots, either 2 or 3 clusters appear to be the optimal number. With 2 or 3 clusters, the silhouette score for each cluster is above average silhouette scores and the fluctuations in size are smaller. The thickness of the silhouette plot for each cluster can help choose between 2 and 3 clusters. The thickness of the plot is more uniform with 3 clusters compared to the plot with only 2 clusters where one cluster is much thicker than the other.
We can see that k=4 or k=5 are not optimal as some of the clusters have below average silhouette scores and there are wide fluctuations in the thickness of the plots, particularly where k = 5.
Scikit-learn has a metrics
module that can be used for evaluating the performance of the clustering.
Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision and recall of a supervised classification algorithm. In particular any evaluation metric should not take the absolute values of the cluster labels into account but rather if this clustering define separations of the data similar to some ground truth set of classes or satisfying some assumption such that members belong to the same class are more similar than members of different classes according to some similarity metric.[37]
If we have the ground truth class assignments and the clustering algorithm assignment of the same samples, the adjusted Rand Index is a function that measures the similarity of the assignments.
The Rand index in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.
The Rand Index measures the agreement between clusterings. It is a number between 0 and 1 where 0 represents little agreement between clustering and 1 represents strong agreement. The adjusted rand index adjusts for agreement due to chance. The adjusted rand index can be negative but it cannot be greater than one.
The Species of Iris were encoded before performing K-means clustering. Iris-Setosa was encoded as 0, Iris-Versicolor as 1 and Iris-Virginica as 2. The resulting clusters were labelled 0, 1 and 2. However as mentioned earlier, the ordering of the cluster labels depends on the initialisation and the label for the cluster could change between runs of the k-means algorithm. The adjusted Rand Index function ignores the permutations and looks at the similarities of the two assignments. Perfect labelling is scored 1.0. The clustering here achieved a score of 0.62. This is as expected given that while the Setosas seem to have all been clustered together, the other two clusters contain a mixture of Versicolors and Virginicas.
# import metrics model
from sklearn import metrics
# the actual species labels
labels_true = yencoded
labels_true
# the assigned cluster labels
labels_pred = y_labels
labels_pred
# calculate the adjusted rand score.
metrics.adjusted_rand_score(labels_true, labels_pred)
The contingency matrix reports the intersection cardinality for every true/predicted cluster pair.[39] The first row of the matrix here shows that there were 50 data points whose true cluster is '1'. These are the Iris Setosa's which have all been clustered together. The second and third rows below show that almost equal numbers of Versicolors and Virginica's have been clustered correctly and incorrectly. 11 Versicolors have been clustered with Virginicas in cluster 0 while 14 Virginicas have been clustered with Versicolors in cluster 2.
from sklearn.metrics.cluster import contingency_matrix
contingency_matrix(labels_true, labels_pred)
I had completed all the coding and write-up bar a few small edits prior to Task 4 being updated on the 16th December 2020. I had interpreted the question on how accurate the k-means clustering to mean how well the data points were clustered and evaluated the clusters using various means such as the silhouette scores, rand index and contingency tables. As k-means clustering is an unsupervised learning technique you usually do not have the actual labels and it is not usually used for making predictions on new data. In supervised learning the algorithm learns from labeled training data and can be used to predict outcomes for unseen data. Unsupervised learning works on unlabelled data and is usually used for finding patterns in the data or discovering new features. For this reason I interpreted the prediction part of the task as written to be how well the model predicted the clusters when compared to the actual groups of species of Iris as this is known for the Iris dataset. K-nearest neighbours on the other hand is a supervised learning algorithm that is used for classification and regression. It assumes that similar data points are close to each other.
KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression)[41]
I am not going to get the chance to look at using the K-nearest neighbours algorithm properly now so just briefly using the example in the Jupyter learning notebook: First downloading the data again, splitting the dataset into a training and test set with one third of the data held out for testing. The model is trained on two thirds of the data. The model uses the species or Class of Iris as the labels.
# first importing the following libraries
import numpy as np
import pandas as pd
# scikit-learn packages
import sklearn.neighbors as nei
import sklearn.model_selection as mod
# download the data
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# using the attribute information as the column names
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']
# read into a pandas dataframe
iris = pd.read_csv(csv_url, names = col_names)
# Inputs and Outputs
inputs = iris[['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width']]
outputs = iris['Class']
# split the data into a training and testing dataset
inputs_train, inputs_test, outputs_train, outputs_test = mod.train_test_split(inputs, outputs, test_size=0.33)
Using the sklearn.neighbors.KNeighborsClassifier
# fit the classifier to the training dataset only
# A classifier model
knn = nei.KNeighborsClassifier(n_neighbors=5)
knn.fit(inputs_train, outputs_train)
obs5 = iris.iloc[5, 0:4].values # Setosa
obs55 = iris.iloc[55, 0:4].values # Versicolor
obs145 = iris.iloc[140, 0:4].values # Virginica
obs20 = iris.iloc[20, 0:4].values # Setosa
obs120 = iris.iloc[120, 0:4].values # Virginica
obs70 = iris.iloc[70, 0:4].values # Versicolor
print(knn.predict([obs5]))
print(knn.predict([obs55]))
print(knn.predict([obs145]))
print(knn.predict([obs20]))
print(knn.predict([obs120]))
print(knn.predict([obs70]))
# Evaluate how the model does
(knn.predict(inputs_test) == outputs_test).sum()
The model correctly predicted 47 of the 50 observations in the test set.
I didn't get time to go through the k-nearest neighbors algorithm in any detail as I did with k-means so I will come back to this at a later date. stats.stackexchange.com notes the key differences between the two algorithms as follows: [42]
K-means is a clustering algorithm that tries to partition a set of points into K sets (clusters) such that the points in each cluster tend to be near each other. It is unsupervised because the points have no external classification.
K-nearest neighbors is a classification (or regression) algorithm that in order to determine the classification of a point, combines the classification of the K nearest points. It is supervised because you are trying to classify a point based on the known classification of other points.
Here are the list of references used in this assignment. Any direct quotes are numbered. Other references were used to gain an understanding of the topics. All online resources were accessed between October 2020 and December 14th, 2020.
math
module at https://docs.python.org/3/library/math.html[18] Website Introduction to Statistics - Onlinestatbook by David M. Lane. Online Edition at http://onlinestatbook.com
Wikipedia contributors, “Chi-squared test — Wikipedia, the free encyclopedia,” 2020, [Online; accessed 1-November-2020][Online]. Available: https://en.wikipedia. org/w/index.php?title=Chi-squared test&oldid=983024096
Website Wikipedia: Pearson's Chi-squared Test at https://en.wikipedia.org
[33] Website K-means clustering with Scikit-learn by Rukshan Pramoditha at https://rukshanpramoditha.medium.com
[34] Website Scikit-learn k-means_silhouette_analysis at https://scikit-learn.org
[41] Website Machine Learning Basics with the K-Nearest Neighbors Algorithm
[42] Website Main differences between k-means and k-nearest neighbor)
Matplotlib. A comprehensive library for creating static, animated, and interactive visualizations in Python.
Website Latex in Jupyter Notebook at https://www.math.ubc.ca
Website Latex Tutorial at https://www.latex-tutorial.com
Some of the theory for Task 2 and Task 3 was based on a review of notes I took while studying statistics modules at UCD previously.