Python for Data Analysis
By Angela C
October 1, 2021
Reading time: 6 minutes.
Some notes on various Python libraries used for data analytics.
Some basic PythonL
- Variable assignent using
=
operator. - Calculations can be performed with variables
- Data types such as strings
str
, integersint
,float
, booleanbool
. - Many Python libraries including
pandas
for data analysis,numpy
for scientific computing,matplotlib
andseaborn
for 2-d plotting,scikit-learn
for machine learning,plotly
for interactive visualisations,dash plotly
for dashboards and many more.
Strings in Python
There are various methods for working with strings including .upper()
, .lower()
and .title()
, .replace()
, .count()
, .strip()
etc.
See post on Strings in blog section. Also String partitioning
Lists
Subsets
mylist[1]
select item at index 1mylist[-2]
select second last item Slicesmylist[1:4]
select items from index 1 up to (but not including) index 4mylist[:3]
select items before index 3mylist[:]
a copy of the list
Lists of lists can be subset
list[0][2]
select list in index 0, from that select items in index 2list[1][:3]
select list in index 1, from that select items up to index 3
Lists Operations
list1 + list2
to add lists togetherlist1*3
to multiply a list
List methods
Lists are mutable. If you don’t want to make changes to a list then assign any changes to a new variable. All the methods below will change the original list.
mylist.index('a')
to get the index of an item in the listmylist.count('a')
to count occurences of an item in the listmylist.append('z)'
to append an item to the end of a listmylist.extend('x')
to append an item to the end of the listmylist.remove('z')
to remove an item from a list (first occurence)del(mylist[0:2])
to remove items up to index 2mylist.reverse()
to reverse a listmylist.pop()
to remove the last item from the listmylist.pop(-2)
to remove the second last itemmylist.insert(1,'x)
to insert an item at index 1. The index must be providedmylist.sort()
to sort a list
Some Python packages
Numpy
Numpy: A Python library for creating and manipulating vectors and matrices. It is the core library for scientific computing in Python. Numpy provides high-performance multi-dimensional array objects and the tools for working with these arrays.
Numpy is usually imported using the alias np
.
import numpy as np
Arrays can be created using np.array()
a = np.array([1,3,5,7])
b = np.array([[3,6,8,9], [2,5,7,9]])
c = np.array([[1.3, 3.5], [3, 5.4]], dtype=float)
Placeholders can be used:
np.zeros
to create an array of zerosnp.ones
to create an array of onesnp.arange()
to create an array withstart
,stop
and optionalstep
parameters.
np.arange(2, 20, 2)
array([ 2, 4, 6, 8, 10, 12, 14, 16, 18])
np.linspace()
to create an array of evenly spaced values
np.linspace(0,5, 10)
array([0. , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
2.77777778, 3.33333333, 3.88888889, 4.44444444, 5. ])
np.full()
to create a constant array
np.full((2,2),8)
array([[8, 8],
[8, 8]])
np.eye()
to create an identity matrix
np.eye(3)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
np.random.random()
to create an array of random values.
See post on Numpy Random project.
np.random.random((2,3))
array([[0.6447133 , 0.30962173, 0.87798529],
[0.01284644, 0.03490406, 0.84898446]])
np.empty()
to create an array of uninitialized (arbitrary) data of the given shape, dtype, and order.
np.empty((2,4))
array([[3., 6., 8., 9.],
[2., 5., 7., 9.]])
Inspecting arrays:
a.shape()
for array dimensions,len(a)
for the length of the arraya.ndim()
for the number of array dimensionsa.size()
for the number of elementsa.dtype()
for the data type of the elementsa.dtype.name
for the name of the data type.a.astype(int)
to convert to another data type.
Numpy Data Types:
np.int64
signed 64-bit integernp.float32
standard double precision floating pointnp.complex
complex numbers represented by 128 floatsnp.bool
for boolean True and False valuesnp.object
for Python object typenp.string_
for fixed-length string typenp.unicode_
for fixed-length unicode type
Operations for performing maths on arrays:
np.add(a,b)
same asa + b
np.substract(a,b)
same asa - b
np.divide(a,b)
same asa / b
np.multiply(a,b)
same asa * b
np.exp(a)
for exponentialsnp.sqrt(b)
for square rootsnp.cos(a)
,np.sin(a)
etc for element wise cosines, sines etcnp.log(a)
element wise natural logarithma.dot(b)
for dot product of two arrays
Comparison operators
a=b
element-wise comparisons, result in Trues or Falsesa > 2
elment-wise comparisonsnp.array_equals(a,b)
array-wise comparisons
Aggregate functions
Array-wise aggregation examples:
a.sum()
a.min()
can specify the axis:
a.min(axis=0)
minimum value of an array rowa.max(axis=1)
maximum value of an array columna.cumsum()
for cumulative suma.mean()
ornp.mean(a)
for meana.std()
ornp.std(a)
for standard deviationnp.median(a)
for mediannp.corrcoef(a)
for correlation coefficients
Copying arrays
np.copy(a)
to copy an array ora.copy()
to create a deep copya.view()
to create a view of the array with same data
Sorting arrays
a.sort()
to sort an arraya.sort(axis=0)
to sort along an axisa.sort(axis=1)
Subsetting, slicing, indexing
Subsetting and slicing using []
.
a[:3]
to select elements up to index 3 (at index 0, 1 and 2)a[1,2]
select elements at row 1 column 2.
This is the same asa[1][2]
a[:,1]
a[::-1]
to reverse an arraya[a>2]
boolean indexing
Array Manipulation
transposing an array
np.transpose(a)
same asa.T
a.ravel
to flatten an array Changing array shapea.reshape(2,3)
to reshape an array with 6 elements to a 2 by 3. If the number of elements is unknown usea.reshape(2,-1)
.
Adding or removing elements
a.resize()
to return a new array with specified shapenp.append(a,b)
to append items to an arraynp.insert()
to insert items into an arraynp.delete()
to delete items from an array
Combining arrays
np.concatenate((a,b), axis=0)
np.vstack((a,b))
to stack arrays vertically (row-wise)np.hstack((a,b))
to stack arrays horizontally (column wise)
Splitting arrays
np.hsplit(a, 2)
split array horizontally at index 2np.vsplit(a,2)
to split vertically at index 2
Pandas - for data wrangling
In a long format dataframe, each row is a complete and independent representation. In a wide dataframe, categorical values are grouped.
pd.pivot_table()
and pd.pivot()
-
pd.pivot_table()
: To transform a long-format dataframe to wide format. Create a spreadsheet-style pivot table as a DataFrame. -
pd.pivot_table()
is also used for generating tables of summary statistics. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame. -
index
: the variables to remain untouched -
columns
: the variables to be spread across more columns -
values
: the numerical values to be aggregated or processed
The output of pivot_table()
is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index()
and rename_axis()
methods.
Some columns might be better represented as column names instead of values.
The output of pivot()
is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index()
and rename_axis()
methods.
.rename_axis()
sets the name of the axis for the index or columns.
df.pivot
: Pivot without aggregation that can handle non-numeric data
pd.pivot
to pivot a dataframe spreading rows into columns
Melting dataframes using the .melt
method
-
pd.melt()
to transform wide to long.
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set. -
id_vars
are values to keep as rows, duplicated as needed. -
value_vars
are columns to be taken and made as values, melted into a new column. If thevalue_vars
is not specified, then all columns that are not included inid_vars
will be used asvalue_vars
. -
var_name
is optional
Group by
The groupby
method allows you to group rows of data together and aggregation functions to be callled on the grouped rows.
'by'
takes a list with the columns you are interested to group.
Docstring: Group DataFrame using a mapper or by a Series of columns. A groupby
operation involves some combination of splitting the object, applying a function, and combining the results.
This can be used to group large amounts of data and compute operations on these groups