s

Importing and viewing the Iris dataset using pandas



The Iris Data Set is available from the UC Irvine Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/Iris in csv format as described above in section 2. I used Python’s pandas library to import the csv file.

Using pandas, tabular data can be imported as a DataFrame object. A pandas DataFrame represents a rectangular table of data containing an ordered collection of columns and each column can have a different value type. The Iris data set contains four numerical columns for the petal and sepal measurements and one categorical column for the class or type of iris.

The pandas read_csvfunction loads delimited data from a file, URL or file-like object using the comma as the default delimiter and creates a DataFrame object. When a pandas DataFrame object is created, it has many attributes and methods available that can be used on that object.

The Iris data set can be read in directly from the url at https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data or alternatively it can be saved locally and read in by specifiying the file path.
In the script, I will download the csv file into python as part of the script. (The csv file containing the Iris data set is also saved into this project’s repository for convenience.)

The raw csv file at the UCI Machine Learning repository does not have the attribute information included in the csv file itself. However this information can be found under the section Iris Data Set: Attribute Information which provides the attribute information consisting of the 4 measurement attributes (sepal length in cm, sepal width in cm, petal length in cm, petal width in cm) and the three classes (Iris Setosa, Iris Versicolor and Iris Virginica).

As outlined in the documentation at pandas-csv-text-files, read_csv has various options for specifying what column names to use. Often the first row of a csv file would contain the column names. If column names are not passed to read_csv, by default it looks to the first row of the data and infers the row names from this row. If the column names are in another row you can specifiy this as an optional argument. In the python script, I create a list containing the names to be used and then pass this as an argument names to read_csv.

The data types for each column will be inferred by the read_csv function if they are not explicitly provided to read_csv. index (row labels) and columns (column labels) could also be provided as optional arguments to read_csv, otherwise they will be constructed from the input data.

There are other optional parameters which can be set for the read_csv function and these can be found using ?pd.read_csv or as mentioned above on the documentation pages at pandas-csv-text-files. My python code for reading in the Iris data set does the following:

  1. Create csv_url and pass to it the url where the data set is available ‘https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'.
  2. Create a list of column names col_names using the iris attribute information.
  3. Create a panda’s DataFrame object called iris.
      csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
      # using the attribute information as the column names
      col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']
      iris =  pd.read_csv(csv_url, names = col_names)

Having loaded the iris data set, the resulting iris DataFrame can be viewed using the DataFrame methods head() and tail() to see the first 5 rows and the last 5 rows respectively. The data types can be checked to ensure they were correctly inferred using dtypes. Use the print function with these to print the results to the screen.

print(iris.head())
print(iris.tail())
print(iris.dtypes)

A DataFrame

A dataframe represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type. A dataframe has both a row and column index and can be thought of as a dict of Series, all sharing the same index. A column in a dataframe can be retrieved as a Series. The columns of the resulting DataFrame have different dtypes.

csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = pd.read_csv(csv_url, header = None)

The csv file at the UCI repository does not contain the variable names. They are located in a separate file

col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']

# read in the dataset from the UCI Machine Learning Repository link and specify column names to use
# save as iris_df
iris =  pd.read_csv(csv_url, names = col_names)
# The columns of the resulting DataFrame have different dtypes.
iris.dtypes

Viewing Data

Can view the top and bottom of the DataFrame, the index and the column names

iris.head()
# View the index of the DataFrame
iris.index
RangeIndex(start=0, stop=150, step=1)
# View the columns of the DataFrame
iris.columns
Index(['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width',
       'Species'],
      dtype='object')
# sorting by an axis
iris.sort_index(axis=1, ascending=False).head(10)
# sorting by values
iris.sort_values(by='Petal_Width').head(10)

Selection

Getting

can select a single column, which yields a Series. Selecting via [] slices the rows.

It is recommended to use the optimized pandas data access methods, .at, .iat, .loc and .iloc.

# Selecting a single column, which yields a Series, equivalent to df.Sepal_Length

# can use labels in the index to select values or a set of values
iris['Sepal_Length'].head()


# Selecting via [], which slices the rows.
# Looking here at the first 5 rows
iris[0:5]

Selection by Label

# Selecting on a multi-axis by label
iris.loc[0:10, ['Sepal_Length', 'Petal_Length']]
# reduction in the dimensions of the returned object
iris.loc[0, ['Sepal_Length', 'Petal_Length']]
# a scalar value
iris.loc[0, 'Petal_Length']
# retrieve a column of the dataframe using a dict-like notation
iris['Petal_Length'].head()
# retrieve a column of data by attribute
iris.Sepal_Length.head()
# as a series using the loc attribute
iris.loc[0]

Selection by position

# Selection by position
iris.iloc[0:3, 0:4]
# getting a scalar value by position using iat
iris.iat[0,0]

Tech used:
  • Python
  • pandas
  • seaborn