In [1]:
Copied!
import pandas as pd
# From a dictionary
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)
df
import pandas as pd
# From a dictionary
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)
df
Out[1]:
Name | Age | City | |
---|---|---|---|
0 | Alice | 25 | New York |
1 | Bob | 30 | Los Angeles |
2 | Charlie | 35 | Chicago |
1.2 Reading Data from Files¶
Commonly, pandas is used to read CSV, Excel, or JSON files.
In [2]:
Copied!
# Example (uncomment if you have a file)
# df_csv = pd.read_csv('data.csv')
# df_excel = pd.read_excel('data.xlsx')
# df_json = pd.read_json('data.json')
pass
# Example (uncomment if you have a file)
# df_csv = pd.read_csv('data.csv')
# df_excel = pd.read_excel('data.xlsx')
# df_json = pd.read_json('data.json')
pass
1.3 Basic Inspection¶
Methods for quickly assessing your DataFrame’s shape and contents.
In [3]:
Copied!
print(df.head()) # First few rows
print(df.tail()) # Last few rows
print(df.info()) # Data types and null counts
print(df.describe()) # Statistical summary
print(df.head()) # First few rows
print(df.tail()) # Last few rows
print(df.info()) # Data types and null counts
print(df.describe()) # Statistical summary
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 3 non-null object 1 Age 3 non-null int64 2 City 3 non-null object dtypes: int64(1), object(2) memory usage: 204.0+ bytes None Age count 3.0 mean 30.0 std 5.0 min 25.0 25% 27.5 50% 30.0 75% 32.5 max 35.0
In [4]:
Copied!
# Label-based indexing
print("Label-based indexing:")
print(df.loc[0, "Name"])
print()
# Integer-based indexing
print("Integer-based indexing:")
print(df.iloc[1, 2])
# Label-based indexing
print("Label-based indexing:")
print(df.loc[0, "Name"])
print()
# Integer-based indexing
print("Integer-based indexing:")
print(df.iloc[1, 2])
Label-based indexing: Alice Integer-based indexing: Los Angeles
2.2 Merging and Joining¶
You can combine DataFrames in various ways using merge()
, join()
, or concat()
.
In [5]:
Copied!
data_extra = {
"Name": ["Alice", "Bob"],
"Salary": [70000, 80000]
}
df_extra = pd.DataFrame(data_extra)
merged_df = pd.merge(df, df_extra, on="Name", how="left")
merged_df
data_extra = {
"Name": ["Alice", "Bob"],
"Salary": [70000, 80000]
}
df_extra = pd.DataFrame(data_extra)
merged_df = pd.merge(df, df_extra, on="Name", how="left")
merged_df
Out[5]:
Name | Age | City | Salary | |
---|---|---|---|---|
0 | Alice | 25 | New York | 70000.0 |
1 | Bob | 30 | Los Angeles | 80000.0 |
2 | Charlie | 35 | Chicago | NaN |
2.3 GroupBy and Aggregation¶
Grouping data by categories and applying aggregate functions like sum
, mean
, or count
.
In [6]:
Copied!
# Example data
df_sales = pd.DataFrame({
"Product": ["A", "A", "B", "B", "B"],
"Sales": [100, 150, 200, 120, 180],
"Region": ["North", "South", "North", "South", "North"]
})
grouped = df_sales.groupby("Product").agg({"Sales": "sum"})
grouped
# Example data
df_sales = pd.DataFrame({
"Product": ["A", "A", "B", "B", "B"],
"Sales": [100, 150, 200, 120, 180],
"Region": ["North", "South", "North", "South", "North"]
})
grouped = df_sales.groupby("Product").agg({"Sales": "sum"})
grouped
Out[6]:
Sales | |
---|---|
Product | |
A | 250 |
B | 500 |
2.4 Handling Missing Data¶
Missing data is common in real datasets. Pandas provides methods like dropna()
, fillna()
, etc.
In [7]:
Copied!
df_missing = pd.DataFrame({
"Col1": [1, None, 3],
"Col2": [None, 5, 6]
})
print(df_missing)
df_dropped = df_missing.dropna()
print("\nAfter dropna:\n", df_dropped)
df_filled = df_missing.fillna(0)
print("\nAfter fillna(0):\n", df_filled)
df_missing = pd.DataFrame({
"Col1": [1, None, 3],
"Col2": [None, 5, 6]
})
print(df_missing)
df_dropped = df_missing.dropna()
print("\nAfter dropna:\n", df_dropped)
df_filled = df_missing.fillna(0)
print("\nAfter fillna(0):\n", df_filled)
Col1 Col2 0 1.0 NaN 1 NaN 5.0 2 3.0 6.0 After dropna: Col1 Col2 2 3.0 6.0 After fillna(0): Col1 Col2 0 1.0 0.0 1 0.0 5.0 2 3.0 6.0
3. Exercises ¶
Exercise 1: Data Cleaning¶
- Create a DataFrame with columns
Name
,Age
,City
, and some missing values. - Drop rows with missing values.
- Fill missing values in
Age
with the mean age.
In [8]:
Copied!
# Your code here
import numpy as np
df_ex = pd.DataFrame({
"Name": ["Tom", "Jane", "Steve", "NaN"],
"Age": [25, None, 30, 22],
"City": ["Boston", "", "Seattle", None]
})
# 1) Create the DataFrame
# 2) Drop rows with missing values
# 3) Fill missing Age with mean
# Your code here
import numpy as np
df_ex = pd.DataFrame({
"Name": ["Tom", "Jane", "Steve", "NaN"],
"Age": [25, None, 30, 22],
"City": ["Boston", "", "Seattle", None]
})
# 1) Create the DataFrame
# 2) Drop rows with missing values
# 3) Fill missing Age with mean
Exercise 2: GroupBy and Aggregation¶
Using the df_sales
DataFrame shown earlier (or create your own):
- Group by
Region
. - Calculate the average sales per region.
- Print the results.
In [9]:
Copied!
# Your code here
# Your code here
Exercise 3: Merging DataFrames¶
- Create two DataFrames
df1
anddf2
with a common column (e.g.,id
). - Perform a left merge on
id
. - Perform an inner merge on
id
.
In [10]:
Copied!
# Your code here
# Your code here
4. Real-World Applications ¶
ETL (Extract, Transform, Load)¶
- Data scientists use pandas to extract data from various sources (databases, APIs, files), transform it (cleaning, feature engineering), and load it into analytics tools.
Exploratory Data Analysis (EDA)¶
- Pandas is essential for quick EDA: summarizing datasets, detecting outliers, etc.
Time-Series Analysis¶
- Pandas offers specialized support for time-series data, making it popular in finance and IoT data processing.
These are just a few examples—pandas is central to nearly every data-related task in Python!