Python DataFrames (pandas)¶

This notebook introduces the basics and advanced features of pandas DataFrame. DataFrames are central to data manipulation in Python.

Table of Contents¶

Basic Concepts
Advanced Concepts
Exercises
Real-World Applications

1. Basic Concepts ¶

1.1 Creating a DataFrame¶

In [1]:

Copied!





import pandas as pd

# From a dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"]
}

df = pd.DataFrame(data)
df
import pandas as pd

# From a dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"]
}

df = pd.DataFrame(data)
df

Out[1]:

	Name	Age	City
0	Alice	25	New York
1	Bob	30	Los Angeles
2	Charlie	35	Chicago

1.2 Reading Data from Files¶

Commonly, pandas is used to read CSV, Excel, or JSON files.

In [2]:

Copied!





# Example (uncomment if you have a file)
# df_csv = pd.read_csv('data.csv')
# df_excel = pd.read_excel('data.xlsx')
# df_json = pd.read_json('data.json')
pass
# Example (uncomment if you have a file)
# df_csv = pd.read_csv('data.csv')
# df_excel = pd.read_excel('data.xlsx')
# df_json = pd.read_json('data.json')
pass

1.3 Basic Inspection¶

Methods for quickly assessing your DataFrame’s shape and contents.

In [3]:

Copied!





print(df.head())     # First few rows
print(df.tail())     # Last few rows
print(df.info())     # Data types and null counts
print(df.describe()) # Statistical summary
print(df.head())     # First few rows
print(df.tail())     # Last few rows
print(df.info())     # Data types and null counts
print(df.describe()) # Statistical summary

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 204.0+ bytes
None
        Age
count   3.0
mean   30.0
std     5.0
min    25.0
25%    27.5
50%    30.0
75%    32.5
max    35.0

2. Advanced Concepts ¶

2.1 Indexing and Selection¶

Pandas offers powerful indexing with .loc (label-based) and .iloc (integer-based).

In [4]:

Copied!





# Label-based indexing
print("Label-based indexing:")
print(df.loc[0, "Name"])
print()

# Integer-based indexing
print("Integer-based indexing:")
print(df.iloc[1, 2])
# Label-based indexing
print("Label-based indexing:")
print(df.loc[0, "Name"])
print()

# Integer-based indexing
print("Integer-based indexing:")
print(df.iloc[1, 2])

Label-based indexing:
Alice

Integer-based indexing:
Los Angeles

2.2 Merging and Joining¶

You can combine DataFrames in various ways using merge(), join(), or concat().

In [5]:

Copied!





data_extra = {
    "Name": ["Alice", "Bob"],
    "Salary": [70000, 80000]
}
df_extra = pd.DataFrame(data_extra)

merged_df = pd.merge(df, df_extra, on="Name", how="left")
merged_df
data_extra = {
    "Name": ["Alice", "Bob"],
    "Salary": [70000, 80000]
}
df_extra = pd.DataFrame(data_extra)

merged_df = pd.merge(df, df_extra, on="Name", how="left")
merged_df

Out[5]:

	Name	Age	City	Salary
0	Alice	25	New York	70000.0
1	Bob	30	Los Angeles	80000.0
2	Charlie	35	Chicago	NaN

2.3 GroupBy and Aggregation¶

Grouping data by categories and applying aggregate functions like sum, mean, or count.

In [6]:

Copied!





# Example data
df_sales = pd.DataFrame({
    "Product": ["A", "A", "B", "B", "B"],
    "Sales": [100, 150, 200, 120, 180],
    "Region": ["North", "South", "North", "South", "North"]
})

grouped = df_sales.groupby("Product").agg({"Sales": "sum"})
grouped
# Example data
df_sales = pd.DataFrame({
    "Product": ["A", "A", "B", "B", "B"],
    "Sales": [100, 150, 200, 120, 180],
    "Region": ["North", "South", "North", "South", "North"]
})

grouped = df_sales.groupby("Product").agg({"Sales": "sum"})
grouped

Out[6]:

	Sales
Product
A	250
B	500

2.4 Handling Missing Data¶

Missing data is common in real datasets. Pandas provides methods like dropna(), fillna(), etc.

In [7]:

Copied!





df_missing = pd.DataFrame({
    "Col1": [1, None, 3],
    "Col2": [None, 5, 6]
})
print(df_missing)

df_dropped = df_missing.dropna()
print("\nAfter dropna:\n", df_dropped)

df_filled = df_missing.fillna(0)
print("\nAfter fillna(0):\n", df_filled)
df_missing = pd.DataFrame({
    "Col1": [1, None, 3],
    "Col2": [None, 5, 6]
})
print(df_missing)

df_dropped = df_missing.dropna()
print("\nAfter dropna:\n", df_dropped)

df_filled = df_missing.fillna(0)
print("\nAfter fillna(0):\n", df_filled)

   Col1  Col2
0   1.0   NaN
1   NaN   5.0
2   3.0   6.0

After dropna:
    Col1  Col2
2   3.0   6.0

After fillna(0):
    Col1  Col2
0   1.0   0.0
1   0.0   5.0
2   3.0   6.0

3. Exercises ¶

Exercise 1: Data Cleaning¶

Create a DataFrame with columns Name, Age, City, and some missing values.
Drop rows with missing values.
Fill missing values in Age with the mean age.

In [8]:

Copied!





# Your code here
import numpy as np

df_ex = pd.DataFrame({
    "Name": ["Tom", "Jane", "Steve", "NaN"],
    "Age": [25, None, 30, 22],
    "City": ["Boston", "", "Seattle", None]
})
# 1) Create the DataFrame
# 2) Drop rows with missing values
# 3) Fill missing Age with mean
# Your code here
import numpy as np

df_ex = pd.DataFrame({
    "Name": ["Tom", "Jane", "Steve", "NaN"],
    "Age": [25, None, 30, 22],
    "City": ["Boston", "", "Seattle", None]
})
# 1) Create the DataFrame
# 2) Drop rows with missing values
# 3) Fill missing Age with mean

Exercise 2: GroupBy and Aggregation¶

Using the df_sales DataFrame shown earlier (or create your own):

Group by Region.
Calculate the average sales per region.
Print the results.

In [9]:

Copied!

# Your code here
# Your code here

Exercise 3: Merging DataFrames¶

Create two DataFrames df1 and df2 with a common column (e.g., id).
Perform a left merge on id.
Perform an inner merge on id.

In [10]:

Copied!

# Your code here
# Your code here

4. Real-World Applications ¶

ETL (Extract, Transform, Load)¶

Data scientists use pandas to extract data from various sources (databases, APIs, files), transform it (cleaning, feature engineering), and load it into analytics tools.

Exploratory Data Analysis (EDA)¶

Pandas is essential for quick EDA: summarizing datasets, detecting outliers, etc.

Time-Series Analysis¶

Pandas offers specialized support for time-series data, making it popular in finance and IoT data processing.

These are just a few examples—pandas is central to nearly every data-related task in Python!