Introduction to Pandas

With Numpy, it may sometimes still be a pain to manipulate data. For example, matrices in Numpy can’t be labelled. The library Pandas has…

Aug 08, 2018

Introduction to Pandas

With Numpy, it may sometimes still be a pain to manipulate data. For example, matrices in Numpy can’t be labelled. The library Pandas has many goodies, but one of the most important is that it allows us to represent data in a table format (not unlike how we could imagine seeing them in Excel). This makes everything a lot more intuitive.

We first import the libraries. Numpy is also imported as we will usually use Numpy functions alongside Pandas

import numpy as np
import pandas as pd

Series

We first start with series before we go to dataframes. A dataframe is basically akin to a table, and a series akin to a column in the table.

Series will be used a lot when we do time series analysis

To create a series.

s1 = pd.Series(np.random.rand(4), index=['a', 'b', 'c', 'd'])
s1

Output

a 0.041447

b 0.275973

c 0.469297

d 0.897424

dtype: float64

To access a specific element in the series.

s1['c']`

Output

0.46929705786676734

We can also create a Series from a dict.

s2 = pd.Series({'a':1, 'b':'2', 'c':3})
s2

Output

a 1

b 2

c 3

dtype: object

To check if there are any entries that are null

pd.isnull(s2)`

Output

a False

b False

c False

dtype: bool

s3 = pd.Series(100, index=['x', 'y', 'z'])
s3`

Output

x 100

y 100

z 100

dtype: int64

Creating a series from an array

a = np.array([2,3,4])
p = pd.Series(a, index=['a', 'b', 'c'])
p

Output

a 2

b 3

c 4

dtype: int64

Extending what we do for series to a dataframe is easy.

data = {'Age':[1,2,3,4,5], 
		'Income':['H', 'M', 'L', 'M', 'L'], 
		'Gender': ['M', 'F', 'M', 'F', 'M']}
profile = pd.DataFrame(data)

Setting columns and indices

profile_2 = pd.DataFrame(data, columns=['Income', 'Age', 'Gender'], index=['A','B','C','D','E'])

cohort = pd.DataFrame([['John', 20, 'Dentist'], 
					  ['Peter', 30, 'Doctor'],
					  ['Dirk', 40, 'Teacher']], 
					 columns=['Name', 'Age', 'Job'])
cohort

cohort['Name']

Output

0 John

1 Peter

2 Dirk

Name: Name, dtype: object

Elements can be accessed by -

loc works on labels in the index.
iloc works on the positions in the index (so it only takes integers).
ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.

cohort.ix[1]

Output

Name Peter

Age 30

Job Doctor

Name: 1, dtype: object

cohort.iloc[2,1]`

Output

40

Pandas also allows you to save the table direct to CSV.

cohort.to_csv('cohort.csv')

And read from a CSV file.

cohort2 = pd.read_csv('cohort.csv')

s2.reindex(['a', 'e', 'b'], fill_value='MISSING')

Output

a 1

e MISSING

b 2

dtype: object

To access the first 2 elements

cohort.tail(2)

Or the last 2 elements

cohort.head(2)

The Jupyter notebook with the code, and some other useful tips is here

Discussion about this post

Ready for more?