Pandas

Pandas is the Python library for tabular data manipulation. It’s built on top of NumPy and provides a structure called a DataFrame, a table with rows and columns where column names act as labels. Think of a DataFrame as a spreadsheet living inside a Python program.

The conventional import:

import pandas as pd

Reading data into a DataFrame is one method call per format:

df = pd.read_csv("my_data.csv")           # CSV
df = pd.read_json("my_data.json")         # JSON
df = pd.read_excel("my_data.xlsx")        # Excel
df = pd.read_sql(query, connection)       # SQL database
df = pd.read_hdf("my_data.h5", "key")     # HDF5

Three patterns cover most of what we do with DataFrames in practice:

Column access by name. Returns a one-dimensional Pandas Series:

df['Name']

Position-based indexing with .iloc. NumPy-style slicing on numeric row and column positions:

df.iloc[0:2, 0]          # rows 0 and 1, column 0
df.iloc[:, -1]           # all rows, last column

Conditional filtering with .loc and a Boolean expression:

df.loc[df['Height'] > 5.8, :]    # rows where Height > 5.8, all columns

The Boolean array df['Height'] > 5.8 is evaluated row by row, and .loc keeps the rows where it’s True.

Pandas also provides the Pandas rolling method for windowed computations (moving averages, rolling features) and fillna, interpolate, dropna for Missing data handling. The pd.merge() and pd.concat() functions handle table joins and concatenations.

Pandas pairs naturally with Matplotlib (DataFrames have a .plot() method that wraps Matplotlib calls), scikit-learn (most sklearn estimators accept DataFrames directly), and NumPy (columns are backed by NumPy arrays by default, though recent pandas versions also support PyArrow-backed dtypes). It’s the lingua franca for tabular data in the Python data-science stack.

Idriss Rami — Notes

Explorer

Pandas

Graph View

Backlinks