The purpose of this project is to investigate the Fisher Iris data set described above using python code. Python is a high level interpreted general purpose programming language. The python interpreter and its extensive standard library are freely available to all. Along with the python standard library, there are many libraries that enhance the usage of python and make it a powerful tool for performing data analytics and machine learning.
How to download this repository.
- Go to the URL for the repository on GitHub at https://github.com/angela1C/pands-project.git.
- Click the green
Clone or download
button
Python 3
To be able to run this script, you need to have Python 3 installed. You can check this on the command line using python -V
.
If you do not have Python 3 installed go to https://www.python.org/downloads/ and follow the instructions there.
Python comes with a library of standard modules that can perform a wide range of tasks. These modules can be imported using the import
function. In addition to the standard modules, there are many third-party packages which enhance it’s functionality and I use some of these packages in this project outlined below, in particular the pandas package which provides data structures and data analysis tools for the Python programming langauge. These packages can be also be imported but they first need to be installed on your system. See Installing Packages of the Python Documentation.
pip
is the package installer for Python and can be used to install packages from the PyPI repository of software for the Python programming language.
pandas installation instructions recommend installing the package as part of the Anaconda distribution, a cross platform distribution for data analysis and scientific computing using conda install pandas
.
The seaborn
package can be installed using pip install seaborn
or conda install seaborn
. seaborn installation instructions
The pandas
library is the main Python library being used in this project. According to the pandas package overview
pandas
is aPython
package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
pandas provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is designed for working with data that is in a tabular format containing an ordered collection of columns where each column can have a different value type. This makes it ideal for exploring a structured tabular dataset such as Iris which contains several numerical columns and one categorical column.
The seaborn library is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
The seaborn library is used for making statistical graphics in Python. It is built on top of matplotlib
and closely integrated with pandas
data structures. It provides a high-level interface for creating nice looking and informative plots. It has many useful features for examining relationships between multiple variables such as those in the Iris dataset.
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms
How to run the python code
To run the Python script, first navigate to the folder downloaded from this repository. The file project_iris.py is the main Python script for the project. (There are other .py
files included in the repository for running smaller sections of the code independently).
At the command line enter python <program_name>
for example: $ python project_iris.py
The Python program can also be run inside the environment of an iPython
session using the %run
command.
% run project_iris.py
I used ipython
to run interactive code. This was very useful in testing sections of the script, rather than running the entire Python script for each change.
There are several plots produced by the script. I have saved these to .png
files in the images folder of the repository. The plots can be printed by uncommenting the plt.show()
command. Some of these plots are shown in the README file.
The output of the Python script (excluding plot images) can be saved to a text file by appending >'filename.txt'
to redirect the output of the script to a file named ‘filename.txt’.
I have saved the output in this repository in a file called iris_output.txt
for convenience.
Loading python libraries
The libraries mentioned above must first be imported before they can be used by the script.
The pandas
library is imported at the very start of the script using import pandas as pd
where pd
is a shorter alias name that is used by convention to save having to write pandas
each time it is used.
Therefore, wherever pd
is used in the script, it is referring to the pandas
library. Similarly, the seaborn
library is imported using the alias sn
and thereafter referred to using sn
. Once these packages are loaded, all of the available functions can be used by the script.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Getting help in Python
To get help on any Python command, use the Python help function as outlined in the Python help command document with the command in parentheses.
For example, help(pd)
will show help on the python pandas
package while help(pd.DataFrame.describe())
provides help on the describe
function of the pandas DataFrame.
The documentation pages for each of the python packages that are used in this project provided details of all the commands for that package. I found these resources quite valuable for this project and referred to them extensively over the course of this project, both when looking for a function to do something in particular but also for getting a start with the packages as the documentation pages are quite comprehensive and outline the different functions of the various packages.
- Pandas.pydata documentation
- Matplotlib documentation
- Seaborn.pydata documentation
- Python 3 documentation
Version Control
- GitHub and Git are used to manage the project.