Python for Data Analysis

By Angela C

October 1, 2021 in Python numpy

Reading time: 6 minutes.

Some notes on various Python libraries used for data analytics.

Some basic PythonL

Variable assignent using = operator.
Calculations can be performed with variables
Data types such as strings str, integers int, float, boolean bool.
Many Python libraries including pandas for data analysis, numpy for scientific computing, matplotlib and seaborn for 2-d plotting, scikit-learn for machine learning, plotly for interactive visualisations, dash plotly for dashboards and many more.

Strings in Python

There are various methods for working with strings including .upper(), .lower() and .title(), .replace(), .count(), .strip() etc.

See post on Strings in blog section. Also String partitioning

Lists

Subsets

mylist[1] select item at index 1
mylist[-2] select second last item Slices
mylist[1:4] select items from index 1 up to (but not including) index 4
mylist[:3] select items before index 3
mylist[:] a copy of the list

Lists of lists can be subset

list[0][2] select list in index 0, from that select items in index 2
list[1][:3] select list in index 1, from that select items up to index 3

Lists Operations

list1 + list2 to add lists together
list1*3 to multiply a list

List methods

Lists are mutable. If you don’t want to make changes to a list then assign any changes to a new variable. All the methods below will change the original list.

mylist.index('a') to get the index of an item in the list
mylist.count('a') to count occurences of an item in the list
mylist.append('z)' to append an item to the end of a list
mylist.extend('x') to append an item to the end of the list
mylist.remove('z') to remove an item from a list (first occurence)
del(mylist[0:2]) to remove items up to index 2
mylist.reverse() to reverse a list
mylist.pop() to remove the last item from the list
mylist.pop(-2) to remove the second last item
mylist.insert(1,'x) to insert an item at index 1. The index must be provided
mylist.sort() to sort a list

Some Python packages

Numpy

Numpy: A Python library for creating and manipulating vectors and matrices. It is the core library for scientific computing in Python. Numpy provides high-performance multi-dimensional array objects and the tools for working with these arrays.

Numpy is usually imported using the alias np.

import numpy as np

Arrays can be created using np.array()

a = np.array([1,3,5,7])
b = np.array([[3,6,8,9], [2,5,7,9]])
c = np.array([[1.3, 3.5], [3, 5.4]], dtype=float)

Placeholders can be used:

np.zeros to create an array of zeros
np.ones to create an array of ones
np.arange() to create an array with start, stop and optional step parameters.

np.arange(2, 20, 2)

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18])

np.linspace() to create an array of evenly spaced values

np.linspace(0,5, 10)

array([0.        , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
       2.77777778, 3.33333333, 3.88888889, 4.44444444, 5.        ])

np.full() to create a constant array

np.full((2,2),8)

array([[8, 8],
       [8, 8]])

np.eye() to create an identity matrix

np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

np.random.random() to create an array of random values.

See post on Numpy Random project.

np.random.random((2,3))

array([[0.6447133 , 0.30962173, 0.87798529],
       [0.01284644, 0.03490406, 0.84898446]])

np.empty() to create an array of uninitialized (arbitrary) data of the given shape, dtype, and order.

np.empty((2,4))

array([[3., 6., 8., 9.],
       [2., 5., 7., 9.]])

Inspecting arrays:

a.shape() for array dimensions,
len(a) for the length of the array
a.ndim() for the number of array dimensions
a.size() for the number of elements
a.dtype() for the data type of the elements
a.dtype.name for the name of the data type.
a.astype(int) to convert to another data type.

Numpy Data Types:

np.int64 signed 64-bit integer
np.float32 standard double precision floating point
np.complex complex numbers represented by 128 floats
np.bool for boolean True and False values
np.object for Python object type
np.string_ for fixed-length string type
np.unicode_ for fixed-length unicode type

Operations for performing maths on arrays:

np.add(a,b) same as a + b
np.substract(a,b) same as a - b
np.divide(a,b) same as a / b
np.multiply(a,b) same as a * b
np.exp(a) for exponentials
np.sqrt(b) for square roots
np.cos(a), np.sin(a) etc for element wise cosines, sines etc
np.log(a) element wise natural logarithm
a.dot(b) for dot product of two arrays

Comparison operators

a=b element-wise comparisons, result in Trues or Falses
a > 2 elment-wise comparisons
np.array_equals(a,b) array-wise comparisons

Aggregate functions

Array-wise aggregation examples:

a.sum()
a.min()

can specify the axis:

a.min(axis=0) minimum value of an array row
a.max(axis=1) maximum value of an array column
a.cumsum() for cumulative sum
a.mean() or np.mean(a) for mean
a.std() or np.std(a) for standard deviation
np.median(a) for median
np.corrcoef(a) for correlation coefficients

Copying arrays

np.copy(a) to copy an array or a.copy() to create a deep copy
a.view() to create a view of the array with same data

Sorting arrays

a.sort() to sort an array
a.sort(axis=0) to sort along an axis
a.sort(axis=1)

Subsetting, slicing, indexing

Subsetting and slicing using [].

a[:3] to select elements up to index 3 (at index 0, 1 and 2)
a[1,2] select elements at row 1 column 2.
This is the same as a[1][2]
a[:,1]
a[::-1] to reverse an array
a[a>2] boolean indexing

Array Manipulation

transposing an array

np.transpose(a) same as a.T
a.ravel to flatten an array Changing array shape
a.reshape(2,3) to reshape an array with 6 elements to a 2 by 3. If the number of elements is unknown use a.reshape(2,-1).

Adding or removing elements

a.resize() to return a new array with specified shape
np.append(a,b) to append items to an array
np.insert() to insert items into an array
np.delete() to delete items from an array

Combining arrays

np.concatenate((a,b), axis=0)
np.vstack((a,b)) to stack arrays vertically (row-wise)
np.hstack((a,b)) to stack arrays horizontally (column wise)

Splitting arrays

np.hsplit(a, 2) split array horizontally at index 2
np.vsplit(a,2) to split vertically at index 2

Pandas - for data wrangling

In a long format dataframe, each row is a complete and independent representation. In a wide dataframe, categorical values are grouped.

`pd.pivot_table()` and `pd.pivot()`

pd.pivot_table(): To transform a long-format dataframe to wide format. Create a spreadsheet-style pivot table as a DataFrame.
pd.pivot_table() is also used for generating tables of summary statistics. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
index: the variables to remain untouched
columns: the variables to be spread across more columns
values: the numerical values to be aggregated or processed

The output of pivot_table() is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index() and rename_axis() methods.

Some columns might be better represented as column names instead of values. The output of pivot() is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index() and rename_axis() methods.

.rename_axis() sets the name of the axis for the index or columns.

df.pivot : Pivot without aggregation that can handle non-numeric data

pd.pivot to pivot a dataframe spreading rows into columns

Melting dataframes using the `.melt` method

pd.melt() to transform wide to long.
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
id_vars are values to keep as rows, duplicated as needed.
value_vars are columns to be taken and made as values, melted into a new column. If the value_vars is not specified, then all columns that are not included in id_vars will be used as value_vars.
var_name is optional

Group by

The groupby method allows you to group rows of data together and aggregation functions to be callled on the grouped rows. 'by' takes a list with the columns you are interested to group.

Docstring: Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups