Introduction to Pandas - Python Data Analysis Library#

Pandas - an external library for data analysis in Python.

It provides two basic data structures:

  • Series : a one-dimensional labeled array holding any type of data, even mixed data types.

  • DataFrame : a two-dimensional data structure that holds labeled data of any type.

In the following notebook, you will see some of the many functionalities of Pandas. This is not a data science course, and we recommend using the pandas documentation to see all the features.

Read more in the Pandas documentation

# First we need to import the pandas library as "pd"
import pandas as pd

# Numpy is still a very useful library, so we will also import that
import numpy as np
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.

The Pandas Series#

# Creating a series directly from numbers (same type)
# which can be noted at the end of the print "dtype: int64"
a_series = pd.Series([1, 2, 5, 6, 8])
print("A series")
print(a_series)
A series
0    1
1    2
2    5
3    6
4    8
dtype: int64
# Creating a series directly from data (mixed types)
# which can be noted at the end of the print "dtype: object"
a_series = pd.Series([1, "mixed data", 5.5, np.nan, 8])
print("A series")
print(a_series)
A series
0             1
1    mixed data
2           5.5
3           NaN
4             8
dtype: object
# Creating a series using numpy functions 
a_series = pd.Series(np.sin(np.linspace(1,6,10)))
print("A series")
print(a_series)
A series
0    0.841471
1    0.999884
2    0.857547
3    0.457273
4   -0.080542
5   -0.594131
6   -0.929015
7   -0.984464
8   -0.743802
9   -0.279415
dtype: float64

Functions to inspect the data#

Pandas offers a lot of functions to inspect or run data analysis directly on the Series or DataFrame (which we will see later in the notebook)

# Creating a series using numpy functions 
a_series = pd.Series(np.linspace(0,10,11))
print("Print first 'n' elements using head(n=5). If no argument is given, default is 5 ")
print(a_series.head())

print()
print("describe() shows the basic statistics of the data")
print("but describe only gives numerical statistics, if all numbers are numeric")
a_series.describe()
Print first 'n' elements using head(n=5). If no argument is given, default is 5 
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

describe() shows the basic statistics of the data
but describe only gives numerical statistics, if all numbers are numeric
count    11.000000
mean      5.000000
std       3.316625
min       0.000000
25%       2.500000
50%       5.000000
75%       7.500000
max      10.000000
dtype: float64
# You can also describe mixed data, or a series of text
the_hobbit_beginning = pd.Series(["In","a","hole","in","the","ground","there","lived","a","hobbit.","Not","a","nasty,","dirty,","wet","hole,","filled","with","the","ends","of","worms","and","an","oozy","smell,","nor","yet","a","dry,","bare,","sandy","hole","with","nothing","in","it","to","sit","down","on","or","to","eat:","it","was","a","hobbit-hole,","and","that","means","comfort"])
the_hobbit_beginning.describe()
count     52
unique    41
top        a
freq       5
dtype: object
# or you can count words:
the_hobbit_beginning.value_counts()
a               5
hole            2
in              2
the             2
and             2
it              2
with            2
to              2
In              1
sit             1
bare,           1
sandy           1
nothing         1
on              1
down            1
yet             1
or              1
eat:            1
was             1
hobbit-hole,    1
that            1
means           1
dry,            1
an              1
nor             1
dirty,          1
ground          1
there           1
lived           1
hobbit.         1
Not             1
nasty,          1
wet             1
smell,          1
hole,           1
filled          1
ends            1
of              1
worms           1
oozy            1
comfort         1
Name: count, dtype: int64

Pandas DataFrame : The workhorse of Pandas#

Pandas dataframe is a 2-dimensional array that, again, allows mixed data types. This object offers a lot of functions to describe, sort, search and analyse data. One advantage of the dataframe, is that the columns are named, just like we know from an excel spreadsheet.

You can read more about DataFrames here

# Let us start by creating a dataframe manually. 
x = np.linspace(0, 10, 11)
squares = x*x
cos = np.cos(x)
sin = np.sin(x)

# Collect all the data in a dictionary (the keys becomes the column names)
data_dict = {"x" : x, "squares" : squares, "cos"  : cos, "sin" : sin}

# and create the dataframe
df = pd.DataFrame(data_dict)

# Let us see the values we just added
df.head()
x squares cos sin
0 0.0 0.0 1.000000 0.000000
1 1.0 1.0 0.540302 0.841471
2 2.0 4.0 -0.416147 0.909297
3 3.0 9.0 -0.989992 0.141120
4 4.0 16.0 -0.653644 -0.756802

CSV dataset#

Let us take a look at a dataset from Kaggle: Used car sales data

In the following cells, we will investigate the “cars_df”, so remember to run the next cell before skipping.

# Start by reading the csv file (Note: pandas' read_csv is much faster than Numpy's)
cars_df = pd.read_csv("cars.csv")

# Let us investigate the data a bit
cars_df.head()
Car_ID Brand Model Year Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Engine Power Seats Price
0 1 Toyota Corolla 2018 50000 Petrol Manual First 15 1498 108 5 800000
1 2 Honda Civic 2019 40000 Petrol Automatic Second 17 1597 140 5 1000000
2 3 Ford Mustang 2017 20000 Petrol Automatic First 10 4951 395 4 2500000
3 4 Maruti Swift 2020 30000 Diesel Manual Third 23 1248 74 5 600000
4 5 Hyundai Sonata 2016 60000 Diesel Automatic Second 18 1999 194 5 850000
# We can also inspect what data types our dataframe holds
cars_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Car_ID             100 non-null    int64 
 1   Brand              100 non-null    object
 2   Model              100 non-null    object
 3   Year               100 non-null    int64 
 4   Kilometers_Driven  100 non-null    int64 
 5   Fuel_Type          100 non-null    object
 6   Transmission       100 non-null    object
 7   Owner_Type         100 non-null    object
 8   Mileage            100 non-null    int64 
 9   Engine             100 non-null    int64 
 10  Power              100 non-null    int64 
 11  Seats              100 non-null    int64 
 12  Price              100 non-null    int64 
dtypes: int64(8), object(5)
memory usage: 10.3+ KB
# Let us start by checking what "Brand"s there are in this dataset

# We can index a column directly as an attribute of the dataframe, here the "Brand"
# and then use the unique() method
cars_df.Brand.unique()
array(['Toyota', 'Honda', 'Ford', 'Maruti', 'Hyundai', 'Tata', 'Mahindra',
       'Volkswagen', 'Audi', 'BMW', 'Mercedes'], dtype=object)
# If we then wants to find the price range of each of these Brands
# then we can group all the products, then extract the "Price" of the products and statistically describe the data
cars_df.groupby("Brand")["Price"].describe()
count mean std min 25% 50% 75% max
Brand
Audi 10.0 2.570000e+06 516505.351161 1900000.0 2050000.0 2600000.0 3000000.0 3200000.0
BMW 10.0 3.030000e+06 302030.167735 2700000.0 2800000.0 2900000.0 3200000.0 3500000.0
Ford 11.0 1.468182e+06 881269.745104 550000.0 675000.0 1500000.0 2250000.0 2700000.0
Honda 6.0 8.083333e+05 120069.424362 650000.0 750000.0 800000.0 850000.0 1000000.0
Hyundai 11.0 7.136364e+05 173336.247062 450000.0 550000.0 800000.0 850000.0 850000.0
Mahindra 5.0 9.400000e+05 250998.007960 700000.0 700000.0 900000.0 1200000.0 1200000.0
Maruti 6.0 7.083333e+05 80104.098938 600000.0 700000.0 700000.0 700000.0 850000.0
Mercedes 10.0 2.880000e+06 625033.332444 2300000.0 2425000.0 2700000.0 2900000.0 4000000.0
Tata 11.0 7.954545e+05 402774.468813 500000.0 500000.0 600000.0 1025000.0 1600000.0
Toyota 10.0 1.490000e+06 678560.568000 650000.0 950000.0 1400000.0 1800000.0 2500000.0
Volkswagen 10.0 1.115000e+06 559290.224799 500000.0 650000.0 1125000.0 1600000.0 1800000.0
# Now we can try and create a correlation matrix, that is, a matrix that give indications on whether or not
# variables are correlated. From intuition, we would think there should be a negative correlation between how old a car is
# and what sales value it should have. Meaning, the higher the age, the lower the price.

# Read more at e.g. : https://builtin.com/data-science/correlation-matrix 
# Remember we have mixed data types, so we must provide the "numeric_only=True" argument 
cars_df.corr(numeric_only=True)
Car_ID Year Kilometers_Driven Mileage Engine Power Seats Price
Car_ID 1.000000 0.059904 -0.227442 0.026140 -0.027965 0.027637 -0.021582 0.037105
Year 0.059904 1.000000 -0.741176 0.213177 -0.355122 -0.249446 -0.252598 -0.232687
Kilometers_Driven -0.227442 -0.741176 1.000000 -0.104437 0.112340 -0.026732 0.396443 -0.051104
Mileage 0.026140 0.213177 -0.104437 1.000000 -0.680949 -0.648894 -0.194581 -0.595252
Engine -0.027965 -0.355122 0.112340 -0.680949 1.000000 0.805709 0.179179 0.714465
Power 0.027637 -0.249446 -0.026732 -0.648894 0.805709 1.000000 -0.102867 0.856620
Seats -0.021582 -0.252598 0.396443 -0.194581 0.179179 -0.102867 1.000000 -0.000027
Price 0.037105 -0.232687 -0.051104 -0.595252 0.714465 0.856620 -0.000027 1.000000

As seen from the matrix above, there is a clear correlation between Engine/Power and Price, and a negative correlation for year. But year is the absolute purchase yeah, not the age. So let us add a column for age, where “now” is 2024

# There are 2 ways to do this, you can a) provide a function or b) give a "lambda" expression.
# We will go with the lambda. axis=1 means loop over rows

# column "age" is now the result of each row's year being subtracted from 2024
cars_df['age'] = cars_df.apply(lambda row: 2024 - row.Year, axis=1)

# And now let us run the same correlation matrix
cars_df.corr(numeric_only=True)
Car_ID Year Kilometers_Driven Mileage Engine Power Seats Price age
Car_ID 1.000000 0.059904 -0.227442 0.026140 -0.027965 0.027637 -0.021582 0.037105 -0.059904
Year 0.059904 1.000000 -0.741176 0.213177 -0.355122 -0.249446 -0.252598 -0.232687 -1.000000
Kilometers_Driven -0.227442 -0.741176 1.000000 -0.104437 0.112340 -0.026732 0.396443 -0.051104 0.741176
Mileage 0.026140 0.213177 -0.104437 1.000000 -0.680949 -0.648894 -0.194581 -0.595252 -0.213177
Engine -0.027965 -0.355122 0.112340 -0.680949 1.000000 0.805709 0.179179 0.714465 0.355122
Power 0.027637 -0.249446 -0.026732 -0.648894 0.805709 1.000000 -0.102867 0.856620 0.249446
Seats -0.021582 -0.252598 0.396443 -0.194581 0.179179 -0.102867 1.000000 -0.000027 0.252598
Price 0.037105 -0.232687 -0.051104 -0.595252 0.714465 0.856620 -0.000027 1.000000 0.232687
age -0.059904 -1.000000 0.741176 -0.213177 0.355122 0.249446 0.252598 0.232687 1.000000

In this dataset, age is not directly impacting the price. But the age has a strong correlation with kilometers driven, which makes sense, and a small negative correlation with Mileage, which also makes sense that newer cars tend to drive longer.

Sorry - not a data science course, but still fun! :-)

Quering data from a dataframe#

Let us try and look into some of the querying methods of Pandas DataFrames, still using the cars dataset. Here quering refers to selecting a subset of the dataframe.

# A subset of all the cars could be to find all "Fords"
ford_df = cars_df[cars_df.Brand == "Ford"]
ford_df.head()
Car_ID Brand Model Year Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Engine Power Seats Price age
2 3 Ford Mustang 2017 20000 Petrol Automatic First 10 4951 395 4 2500000 7
11 12 Ford Endeavour 2017 35000 Diesel Automatic Second 12 2198 158 7 2000000 7
21 22 Ford Figo 2020 15000 Petrol Manual Third 18 1194 94 5 550000 4
30 31 Ford Aspire 2019 26000 Petrol Manual Third 20 1194 94 5 600000 5
40 41 Ford Ranger 2017 38000 Diesel Manual Second 12 2198 158 5 1500000 7
# or let us only see cars sold before 2020
older_cars_df = cars_df[cars_df.Year < 2020]
older_cars_df.head()
Car_ID Brand Model Year Kilometers_Driven Fuel_Type Transmission Owner_Type Mileage Engine Power Seats Price age
0 1 Toyota Corolla 2018 50000 Petrol Manual First 15 1498 108 5 800000 6
1 2 Honda Civic 2019 40000 Petrol Automatic Second 17 1597 140 5 1000000 5
2 3 Ford Mustang 2017 20000 Petrol Automatic First 10 4951 395 4 2500000 7
4 5 Hyundai Sonata 2016 60000 Diesel Automatic Second 18 1999 194 5 850000 8
5 6 Tata Nexon 2019 35000 Petrol Manual First 17 1198 108 5 750000 5
# It is also possible to extract specific columns from the dataset
# Note two [[ ]] when having multiple columns, if a single column is needed, then ["Model"]
cars_df[["Model", "Price"]]
Model Price
0 Corolla 800000
1 Civic 1000000
2 Mustang 2500000
3 Swift 600000
4 Sonata 850000
... ... ...
95 C-Class 2900000
96 Innova Crysta 1400000
97 EcoSport 750000
98 Verna 850000
99 Altroz 600000

100 rows × 2 columns

Plotting data from dataframes#

Let us finally see how we can plot data directly from dataframes.

Note that we are using matplotlib as the plotting library in this case. Pandas has its own slim built-in plotting library. However the following example with multiple labels in a scatter plot is not straightforward in pure pandas.

import matplotlib.pyplot as plt
# We want to plot the kilometers driven on the X-axis and the Price on the Y-axis grouped by each Brand (color)

# Create the plot and axis
fig, ax = plt.subplots(figsize=(12,6))

# Loop over the groups of data
for name, group in cars_df.groupby('Brand'):
    ax.scatter(group.Kilometers_Driven, group.Price, label=name)

# Add legend and labels
plt.legend()
plt.xlabel("Kilometers driven")
plt.ylabel("Price")
Text(0, 0.5, 'Price')
../_images/e4a85a21fcc599c07291629b2518625432e42c1ed8cdc15f5ee15c5625f798c0.png

Time series data : The weather in Stockholm#

Pandas has a lot of neat functions to work with time series data. Below we will revisit the weather data from the matplotlib workbook.

# Importing os to help with paths
import os
# First we need to read in the data. But since we have a 'date' field, we can tell pandas to treat it as such.
# The first column is a date:
date_cols = ['date']

# using the date_cols, we can tell pandas to convert into datetimes by using the 'parse_dates' argument 
weather_df = pd.read_csv(os.path.join("..","matplotlib", "stockholm_daily_mean_avg_std_temperature.csv"), parse_dates=date_cols)

# Let us investigate the data a bit  - and we can see that the date field is now a datetime
weather_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45132 entries, 0 to 45131
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date        45132 non-null  datetime64[ns]
 1   mean        45132 non-null  float64       
 2   avg_toy     45132 non-null  float64       
 3   stddev_toy  45132 non-null  float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 1.4 MB
# Let's extract 2000 to 2010
# 'date' is the column name
sub_df = weather_df[weather_df['date'].between('2000-01-01', '2010-01-01')]
sub_df.head()
date mean avg_toy stddev_toy
36524 2000-01-01 -2.3 -2.23 4.96
36525 2000-01-02 1.3 -1.78 4.57
36526 2000-01-03 0.8 -1.93 4.55
36527 2000-01-04 3.5 -2.35 4.83
36528 2000-01-05 -0.6 -2.78 5.00
# Now lets plot the mean temperature
plt.plot(sub_df['date'], sub_df['mean'])
plt.xlabel("Date")
plt.ylabel("Mean temperature [C]")
Text(0, 0.5, 'Mean temperature [C]')
../_images/029e92c8ecbee1d1675abf75f215880c7880a5adef75e0a358b2619848c1cb5d.png
# We can also create a rolling mean of e.g. 7 days.

# each row in the data frame is one day, so that is just a rolling mean of 7 rows:
sub_df.loc[:,'weekly_mean'] = sub_df['mean'].rolling(7).mean()

# Now let's see what data we have
sub_df.info()

# and plot it using the mean
plt.plot(sub_df['date'], sub_df['weekly_mean'])
plt.xlabel("Date")
plt.ylabel("Mean temperature [C]")
<class 'pandas.core.frame.DataFrame'>
Index: 3654 entries, 36524 to 40177
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         3654 non-null   datetime64[ns]
 1   mean         3654 non-null   float64       
 2   avg_toy      3654 non-null   float64       
 3   stddev_toy   3654 non-null   float64       
 4   weekly_mean  3648 non-null   float64       
dtypes: datetime64[ns](1), float64(4)
memory usage: 171.3 KB
/var/folders/0t/hldwbyfj7ls81tng6gt60p7r0000gn/T/ipykernel_98543/1679289622.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub_df.loc[:,'weekly_mean'] = sub_df['mean'].rolling(7).mean()
Text(0, 0.5, 'Mean temperature [C]')
../_images/0897c20e03e0d2d7e903a1c7e1eb9636cda716cb83376c15793ad95dc15c38b7.png
# Imagine if we instead want to see the yearly mean on all the data
weather_df.loc[:,'yearly_mean'] = weather_df['mean'].rolling(365).mean()
weather_df.loc[:,'ten_year_mean'] = weather_df['mean'].rolling(3650).mean()

plt.plot(weather_df['date'], weather_df['yearly_mean'], 'orange', label="Yearly running average")
plt.plot(weather_df['date'], weather_df['ten_year_mean'], 'b', label="Ten year running average")
plt.xlabel("Date")
plt.ylabel("Mean temperature [C]")
plt.legend(loc="upper left")
<matplotlib.legend.Legend at 0x7f77bb66edc0>
../_images/d7605c6a7cf5524e8af3a339cf50cc2b503ab6dba649a842f2a3d70008287268.png