Pandas
By Angela C
March 3, 2021
Reading time: 6 minutes.
pandas is one of the most commonly used libraries in Python for data analysis and is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
The pandas user guide and the 10 minutes to pandas tutorial in particular is the starting point. The tutorial outlines how to create data structures in pandas, get data into and out of pandas, view data, edit and filter data, merge datasets as well as how to perform group operations and to reshape data. pandas also has time-series and plotting functionality. The two main data structures are a Series
which is a one-dimensional labeled array capable of holding any data type. A pandas ‘DataFrame’ is a 2-dimensional labeled data structure with columns of potentially different types. A DataFrame is like a spreadsheet or SQL table.
With pandas you can create DataFrames out of a dataset or subsets of a dataset. Data can be read in from a variety of file formats including CSV and text files, Excel, JSON, HTML, SQL etc.
Transformations can be applied to the rows and / or columns of a DataFrame.
-
DataFrame()
function- data
- index
- rows
- columns
-
pandas automatically creates an index.
- set index to None to stop this
-
shape()
to get the shape or size of the dataset -
index
to get the index of the dataframe -
read_csv
to read from a csv file -
read_excel
,read_json
,read_html
to read from other formats- can specify column
names
- can specify if the dataset has a
header
to use as the column names or not and which row to use - can specify how to parse dates using
parse_dates
argument - there are any other arguments can be provided
- can specify column
-
export data from a DataFrame to various file formats
to_csv
to_excel
to_json
to_html
- these functions can take arguments such as
encoding
for the encoding type such as “utf-8”sep
for symbols to separate on such as\t
write_excel
requires you to create an instance of the writer object first
-
Select components of the dataframe:
loc
to select rows and columns using labelsiloc
to select rows and columns using the index of the entries.- can select columns using their name
-
Add new rows and columns
- can use
loc
to add new rows and assign values to the new rows - can use
loc
to add new columns and provide the name of the new column
- can use
-
Add an index using
set_index()
-
Deleting rows and columns
drop()
function which takesaxis
andinplace
argumentsaxis
to indicate which axis to delete,0
for rows,1
for columns.
inplace
to indicate if the changes effects the existing dataframe or not. The default is not to change the existing dataframe.- can create a new dataframe with the changes
- or set
inplace=True
to make the changes apply to the existing dataframe.
-
Rename rows and columns and the index
.rename()
takes a dictionary with the old and new valuesinplace=True
to make the change in the existing dataframe
df = df.rename(columns = {'year': 'Year', 'maxtp' : 'max-tp', 'mintp' : 'min-tp'} )
-
Reset the index:
reset_index()
function to revert to the original index if you have made changes
-
Reshape dataframes:
pivot()
to create a new table with selected values and columnsstack()
stacks the columns so that the columns become part of multi-level indices.unstack()
melt()
-
Iterating over dataframe rows:
iterrows()
with afor
loopfor idx, row in dataframe.iterrows(): do something
Plotting with Pandas
-
import matplotlib library *
import matplotlib.pyplot as plt
plt.show()
to show the plot
-
Some examples:
- bar chart:
df.value_counts(sort=False).plot.bar()
to plot the values of a dataframedf
as a bar chart - histogram:
df.value_counts(sort=True).plot.hist(orientation='vertical')
to plot a histogram vertically. The default is horizontal. - area plot :
df.plot.area(stacked=False)
to show how values change with respect to other variables.stacked=True
to stack rather than overlap. - pie chart:
df.value_counts(sort=False).plot.pie()
- density plot:
df.value_counts.plot.kde()
to plot the probability distribution of a variable - scatter matrix: to generate a pair-wise scatter matrix for selected variables.
alpha
to set the transparencyfigsize
to set the size of the figurediagonals
to specify what should be shown on the diagonals ( instead of a variable plotted against itself)bootstrap_plot
- bar chart:
-
can set the relative size of the figure using the
figsize
argument.
Reshaping data, Grouping data, Aggregating data
In a long format dataframe, each row is a complete and independent representation. In a wide dataframe, categorical values are grouped.
pd.pivot_table()
and pd.pivot()
-
pd.pivot_table()
: To transform a long-format dataframe to wide format. Create a spreadsheet-style pivot table as a DataFrame. -
pd.pivot_table()
is also used for generating tables of summary statistics. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame. -
index
: the variables to remain untouched -
columns
: the variables to be spread across more columns -
values
: the numerical values to be aggregated or processed
The output of pivot_table()
is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index()
and rename_axis()
methods.
Some columns might be better represented as column names instead of values.
The output of pivot()
is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index()
and rename_axis()
methods.
.rename_axis()
sets the name of the axis for the index or columns.
df.pivot
: Pivot without aggregation that can handle non-numeric data
pd.pivot
to pivot a dataframe spreading rows into columns
Melting dataframes using the .melt
method
-
pd.melt()
to transform wide to long.
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set. -
id_vars
are values to keep as rows, duplicated as needed. -
value_vars
are columns to be taken and made as values, melted into a new column. If thevalue_vars
is not specified, then all columns that are not included inid_vars
will be used asvalue_vars
. -
var_name
is optional
Group by
The groupby
method allows you to group rows of data together and aggregation functions to be callled on the grouped rows.
'by'
takes a list with the columns you are interested to group.
Docstring: Group DataFrame using a mapper or by a Series of columns. A groupby
operation involves some combination of splitting the object, applying a function, and combining the results.
This can be used to group large amounts of data and compute operations on these groups.
The .agg()
method can be used to chain multiple methods to group by
- Use
.agg()
method to groupby and chain multiple methpds - provide the aggregation functions as arguments in a list
- custom functions can also be used when grouping and aggregating
- you can pass a tuple containing the name of the variable to aggregate and the aggregate function to use as well as the name of the resulting column
Stacking and Unstacking
.apply()
to apply a function along an axis of the dataframe
df.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)
Apply a function along an axis of the dataframe. The function can be a built-in function or a custom function. The function is applied / broadcast to the dataframe and assigned to a series or a dataframe.
-
func
: Function to apply to each column or row. -
axis
: the axis along which the function is applied. -
0 or ‘index’: apply function to each column.
-
1 or ‘columns’: apply function to each row.
-
raw
: bool, default False Determines if row or column is passed as a Series or ndarray object.