String partitioning a column of a pandas dataframe
By Angela C
September 15, 2021
Reading time: 1 minutes.
I have a dataframe of URLs to datasets available from the PxStat database of the Central Statistics Office (CSO). The URLs follow the same format. The dataset code can be extracted using string partitioning.
pandas str.partition
method splits a string into three parts using the given separator. This will search for the separator in the string. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.
This is especially useful when splitting a URL into parts.
For example I have a dataframe containing URLs to datasets from the CSO’s PxStat database and want to create a file name for the CSV dataset retrieved from each URL.
Each URL follows the same format:
“https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/DHA09/CSV/1.0/en"
df['url'].str.partition('/CSV/')
will split the string into 3 parts, the part before ‘/CSV/’ in position 0, ‘/CSV/’ in position 1 and the part after ‘/CSV/’ in position 2.
“https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/DHA09", “/CSV/” and “1.0/en”
I want to extract the 5-character dataset code, the “DHA09” part of the URL. Each of the three parts can be retrieving using indexing (from 0 to 2)
To further split the first part of the URL, call str.partition
again.
df['url'].str.partition('/CSV/')[0]\ .str.partition('https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/')
I can create a new column containing the dataset code in the dataframe with the results. This is then useful for automating the process with a function of reading in the CSV files and saving them locally with the file name based on the dataset code.
df[‘dataset_code’] = df[‘url’].str.partition('/CSV/')[0]
.str.partition(‘https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/')`