Introduction to Pandas
With Numpy, it may sometimes still be a pain to manipulate data. For example, matrices in Numpy can’t be labelled. The library Pandas has…
Introduction to Pandas
With Numpy, it may sometimes still be a pain to manipulate data. For example, matrices in Numpy can’t be labelled. The library Pandas has many goodies, but one of the most important is that it allows us to represent data in a table format (not unlike how we could imagine seeing them in Excel). This makes everything a lot more intuitive.
We first import the libraries. Numpy is also imported as we will usually use Numpy functions alongside Pandas
import numpy as np
import pandas as pdSeries
We first start with series before we go to dataframes. A dataframe is basically akin to a table, and a series akin to a column in the table.
Series will be used a lot when we do time series analysis
To create a series.
s1 = pd.Series(np.random.rand(4), index=['a', 'b', 'c', 'd'])
s1Output
a 0.041447
b 0.275973
c 0.469297
d 0.897424
dtype: float64
To access a specific element in the series.
s1['c']`Output
0.46929705786676734
We can also create a Series from a dict.
s2 = pd.Series({'a':1, 'b':'2', 'c':3})
s2Output
a 1
b 2
c 3
dtype: object
To check if there are any entries that are null
pd.isnull(s2)`Output
a False
b False
c False
dtype: bool
s3 = pd.Series(100, index=['x', 'y', 'z'])
s3`Output
x 100
y 100
z 100
dtype: int64
Creating a series from an array
a = np.array([2,3,4])
p = pd.Series(a, index=['a', 'b', 'c'])
pOutput
a 2
b 3
c 4
dtype: int64
Extending what we do for series to a dataframe is easy.
data = {'Age':[1,2,3,4,5],
'Income':['H', 'M', 'L', 'M', 'L'],
'Gender': ['M', 'F', 'M', 'F', 'M']}
profile = pd.DataFrame(data)Setting columns and indices
profile_2 = pd.DataFrame(data, columns=['Income', 'Age', 'Gender'], index=['A','B','C','D','E'])cohort = pd.DataFrame([['John', 20, 'Dentist'],
['Peter', 30, 'Doctor'],
['Dirk', 40, 'Teacher']],
columns=['Name', 'Age', 'Job'])
cohortcohort['Name']Output
0 John
1 Peter
2 Dirk
Name: Name, dtype: object
Elements can be accessed by -
loc works on labels in the index.
iloc works on the positions in the index (so it only takes integers).
ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.
cohort.ix[1]Output
Name Peter
Age 30
Job Doctor
Name: 1, dtype: object
cohort.iloc[2,1]`Output
40
Pandas also allows you to save the table direct to CSV.
cohort.to_csv('cohort.csv')And read from a CSV file.
cohort2 = pd.read_csv('cohort.csv')s2.reindex(['a', 'e', 'b'], fill_value='MISSING')Output
a 1
e MISSING
b 2
dtype: object
To access the first 2 elements
cohort.tail(2)Or the last 2 elements
cohort.head(2)The Jupyter notebook with the code, and some other useful tips is here


