⚡ Reducing BERT with Pruna: Efficiency Meets Performance¶

In this session, we explore model compression and efficiency using the powerful library Pruna. I recommend to use a GPU for this session.

Introduction¶

🌍 What is Pruna?¶

Pruna is an open-source Python library for neural network compression and acceleration. It enables you to:

🔧 Reduce model size by pruning unnecessary weights and neurons,
⚡ Accelerate inference by creating smaller, faster models,
♻️ Lower carbon footprint and memory usage, making NLP more sustainable.

Pruna supports various pruning strategies (like structured and unstructured pruning) and allows you to analyze trade-offs between performance and resource efficiency.

🧪 What We'll Do in This Notebook¶

Step	Description
1. Fine-tune a BERT model	Train `bert-base-uncased` on the AG News dataset for news classification.
2. Compress with Pruna	Apply Pruna's pruning techniques to the fine-tuned model.
3. Evaluate Metrics	Measure: - Inference speed - RAM usage - Carbon footprint - F1, Precision, Recall (macro)
4. Compare Trade-offs	Identify how much we can compress the model while maintaining good accuracy.

🎯 Learning Objectives¶

Understand how pruning can reduce transformer models' size and compute.
Measure resource-related impacts (like carbon footprint) of pruning.
Learn to balance model performance and sustainability in NLP.

Let’s get started by loading the AG News dataset and preparing our fine-tuning pipeline!

In [1]:

Copied!





from datasets import load_dataset
from collections import Counter

# Load AG News dataset
dataset = load_dataset("ag_news")
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(10000))

# Show train/test sizes
print("Train size:", len(dataset["train"]))
print("Test size:", len(dataset["test"]))

# Show label distribution
print("\nLabel distribution:")
label_names = dataset["train"].features["label"].names
label_counts = Counter(dataset["train"]["label"])

for i, label_name in enumerate(label_names):
    count = label_counts[i]
    print(f"{label_name}: {count}")
from datasets import load_dataset
from collections import Counter

# Load AG News dataset
dataset = load_dataset("ag_news")
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(10000))

# Show train/test sizes
print("Train size:", len(dataset["train"]))
print("Test size:", len(dataset["test"]))

# Show label distribution
print("\nLabel distribution:")
label_names = dataset["train"].features["label"].names
label_counts = Counter(dataset["train"]["label"])

for i, label_name in enumerate(label_names):
    count = label_counts[i]
    print(f"{label_name}: {count}")

Train size: 10000
Test size: 7600

Label distribution:
World: 2530
Sports: 2528
Business: 2407
Sci/Tech: 2535

BERT Model on AG News¶

🏋️‍♂️ 1️⃣ Fine-Tune BERT on AG News¶

We start by fine-tuning the bert-base-uncased model using the AG News dataset.
This will give us a baseline model that’s optimized for news topic classification.

🧪 2️⃣ Evaluate Core Performance Metrics¶

Once we have a fine-tuned model, we’ll evaluate the following traditional NLP metrics:

✅ F1 Score (macro) – Overall balance across classes
✅ Precision (macro) – How precise the predictions are
✅ Recall (macro) – How well the model captures all relevant samples

These help us ensure our model is performant before we start optimizing for size and speed.

🌱 3️⃣ Analyze Environmental and Efficiency Metrics¶

Next, we’ll use Pruna to prune and compress the BERT model, making it faster and smaller. We’ll then evaluate:

💨 Inference Speed – How quickly the model predicts on new data.
💻 RAM Usage – Memory footprint during inference.
🌍 Carbon Footprint – CO2-equivalent emissions per request.

These metrics are essential for deploying sustainable NLP models in real-world applications.

🚀 Goal¶

By the end of this notebook, you’ll learn how to:

Build a high-quality news classifier with BERT,
Compress it with Pruna for real-world deployment,
And measure the full spectrum of trade-offs: accuracy, speed, carbon impact, and memory.

Let’s get started!

In [2]:

Copied!





from transformers import AutoTokenizer

# Load tokenizer for BERT-base-uncased
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Tokenization function for the dataset
def tokenize_batch(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=128)

# Apply tokenization to train and test splits
tokenized_dataset = dataset.map(tokenize_batch, batched=True)

# Set format for PyTorch
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# Quick check
print(tokenized_dataset["train"][0])
from transformers import AutoTokenizer

# Load tokenizer for BERT-base-uncased
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Tokenization function for the dataset
def tokenize_batch(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=128)

# Apply tokenization to train and test splits
tokenized_dataset = dataset.map(tokenize_batch, batched=True)

# Set format for PyTorch
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# Quick check
print(tokenized_dataset["train"][0])

{'label': tensor(0), 'input_ids': tensor([  101,  7269, 11498,  2135,  6924,  2011,  9326,  4559, 10134,  2031,
         2716,  2116,  4865,  1998,  3655,  1999,  7269,  2000,  1037,  9190,
         1010,  1996,  2154,  2044,  2324,  2111,  2351,  1999, 18217,  2012,
         1037,  2576,  8320,  1012,   102,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])}

In [3]:

Copied!





from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import numpy as np

# Load pretrained model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=4)

# Define compute_metrics for macro evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    precision = precision_score(labels, predictions, average='macro')
    recall = recall_score(labels, predictions, average='macro')
    f1 = f1_score(labels, predictions, average='macro')
    acc = accuracy_score(labels, predictions)
    
    return {
        "accuracy": acc,
        "f1": f1,
        "precision": precision,
        "recall": recall
    }

# Training setup
training_args = TrainingArguments(
    output_dir="./bert_agnews",
    eval_strategy="epoch",
    save_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=200,
    report_to="none",
    seed=42,
    gradient_accumulation_steps=2
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# 🚀 Train
trainer.train()
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import numpy as np

# Load pretrained model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=4)

# Define compute_metrics for macro evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    precision = precision_score(labels, predictions, average='macro')
    recall = recall_score(labels, predictions, average='macro')
    f1 = f1_score(labels, predictions, average='macro')
    acc = accuracy_score(labels, predictions)
    
    return {
        "accuracy": acc,
        "f1": f1,
        "precision": precision,
        "recall": recall
    }

# Training setup
training_args = TrainingArguments(
    output_dir="./bert_agnews",
    eval_strategy="epoch",
    save_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=200,
    report_to="none",
    seed=42,
    gradient_accumulation_steps=2
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# 🚀 Train
trainer.train()

2025-05-24 14:34:56.822356: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-24 14:34:56.840019: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1748097296.857354 7030 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748097296.862047 7030 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1748097296.874036 7030 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1748097296.874050 7030 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1748097296.874051 7030 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1748097296.874053 7030 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-24 14:34:56.878374: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/tmp/ipykernel_7030/1089717668.py:43: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
trainer = Trainer(

[780/780 14:29, Epoch 9/10]

Epoch	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
1	No log	0.323253	0.895921	0.895657	0.898335	0.895921
2	No log	0.286806	0.902368	0.901983	0.908402	0.902368
3	0.375100	0.249602	0.919474	0.919110	0.919976	0.919474
4	0.375100	0.255818	0.918553	0.918654	0.918782	0.918553
5	0.375100	0.257930	0.922237	0.922311	0.922678	0.922237
6	0.126100	0.279408	0.920526	0.920448	0.920480	0.920526
7	0.126100	0.312654	0.915132	0.914818	0.915857	0.915132
8	0.056300	0.316371	0.918947	0.918810	0.918995	0.918947
9	0.056300	0.329148	0.919342	0.919221	0.919192	0.919342

Out[3]:

TrainOutput(global_step=780, training_loss=0.1495015780131022, metrics={'train_runtime': 870.928, 'train_samples_per_second': 114.82, 'train_steps_per_second': 0.896, 'total_flos': 6501064694611968.0, 'train_loss': 0.1495015780131022, 'epoch': 9.878980891719745})

In [52]:

Copied!





import time
import psutil
import os
import gc
import logging
from codecarbon import EmissionsTracker
from tqdm import tqdm
import torch
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score

logging.getLogger("codecarbon").disabled = True

# Improved helper function to measure inference speed and memory usage
def measure_inference_metrics(model, dataset, device="cpu", batch_size=32):
    model.to(device)
    model.eval()
    
    # Clear memory and get baseline measurements
    gc.collect()
    if torch.cuda.is_available() and device != "cpu":
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats(device)
        gpu_memory_baseline = torch.cuda.memory_allocated(device) / 1e6  # MB
    
    process = psutil.Process(os.getpid())
    cpu_memory_baseline = process.memory_info().rss / 1e6  # MB
    
    # Start tracking
    tracker = EmissionsTracker(project_name="bert_agnews", measure_power_secs=1)
    tracker.start()
    
    start_time = time.time()
    total_samples = len(dataset)
    
    # Collect all predictions and true labels for macro metrics
    all_predictions = []
    all_labels = []
    
    # Track peak memory during inference
    peak_cpu_memory = cpu_memory_baseline
    
    # Evaluate in batches
    for i in tqdm(range(0, total_samples, batch_size)):
        batch = dataset[i: i + batch_size]
        inputs = {
            "input_ids": batch["input_ids"].to(device, dtype=torch.long),
            "attention_mask": batch["attention_mask"].to(device, dtype=torch.long)
        }
        labels = batch["label"].to(device)
        
        with torch.no_grad():
            outputs = model(**inputs).logits
            # Handle different output formats from compressed models
            if hasattr(outputs, 'logits'):
                logits = outputs.logits
            elif isinstance(outputs, torch.Tensor):
                logits = outputs
            elif isinstance(outputs, (tuple, list)) and len(outputs) > 0:
                logits = outputs[0]
            else:
                raise ValueError(f"Unexpected output format: {type(outputs)}")
            
            preds = torch.argmax(outputs, axis=-1)
            
            # Collect predictions and labels for metric calculation
            all_predictions.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
        
        # Track peak CPU memory during inference
        current_cpu_memory = process.memory_info().rss / 1e6
        peak_cpu_memory = max(peak_cpu_memory, current_cpu_memory)
    
    end_time = time.time()
    emissions: float = tracker.stop()
    
    # Calculate macro metrics
    f1_macro = f1_score(all_labels, all_predictions, average='macro')
    precision_macro = precision_score(all_labels, all_predictions, average='macro')
    recall_macro = recall_score(all_labels, all_predictions, average='macro')
    
    # Final memory measurements
    cpu_memory_final = process.memory_info().rss / 1e6  # MB
    cpu_memory_used = peak_cpu_memory - cpu_memory_baseline
    
    metrics = {
        "inference_speed (samples/sec)": total_samples / (end_time - start_time),
        "cpu_memory_used (MB)": cpu_memory_used,
        "cpu_memory_peak (MB)": peak_cpu_memory,
        "carbon_footprint (kg CO2eq)": emissions,
        "f1_macro": f1_macro,
        "precision_macro": precision_macro,
        "recall_macro": recall_macro
    }
    
    # Add GPU memory metrics if using CUDA
    if torch.cuda.is_available() and device != "cpu":
        gpu_memory_peak = torch.cuda.max_memory_allocated(device) / 1e6  # MB
        gpu_memory_current = torch.cuda.memory_allocated(device) / 1e6  # MB
        gpu_memory_used = gpu_memory_peak - gpu_memory_baseline
        
        metrics.update({
            "gpu_memory_used (MB)": gpu_memory_used,
            "gpu_memory_peak (MB)": gpu_memory_peak,
            "gpu_memory_current (MB)": gpu_memory_current
        })
    
    return metrics

# Evaluate on test set with improved memory tracking
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

print("📊 Evaluating Original Model...")
original_metrics = measure_inference_metrics(
    model,
    tokenized_dataset["test"],
    device=device,
    batch_size=32
)
original_eval = trainer.evaluate(eval_dataset=tokenized_dataset["test"])

# Remove duplicate metrics from inference
for metric in ['f1', 'precision', 'recall']:
    if f"{metric}_macro" in original_metrics and f"eval_{metric}" in original_eval:
        del original_metrics[f"{metric}_macro"]

all_metrics = {**original_eval, **original_metrics}

# Display all metrics with better formatting
print("\n" + "="*50)
print("BERT MODEL PERFORMANCE METRICS")
print("="*50)

print("\n📊 Classification Metrics (Macro):")
for k, v in all_metrics.items():
    if any(metric in k.lower() for metric in ['f1', 'precision', 'recall']):
        print(f"  {k}: {v:.4f}" if isinstance(v, float) else f"  {k}: {v}")

print("\n⚡ Performance Metrics:")
for k, v in all_metrics.items():
    if 'speed' in k.lower():
        print(f"  {k}: {v:.2f}")

print("\n💾 Memory Usage:")
for k, v in all_metrics.items():
    if 'memory' in k.lower():
        print(f"  {k}: {v:.2f} MB")

print("\n🌍 Environmental Impact:")
for k, v in all_metrics.items():
    if 'carbon' in k.lower():
        print(f"  {k}: {v:.6f}")
import time
import psutil
import os
import gc
import logging
from codecarbon import EmissionsTracker
from tqdm import tqdm
import torch
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score

logging.getLogger("codecarbon").disabled = True

# Improved helper function to measure inference speed and memory usage
def measure_inference_metrics(model, dataset, device="cpu", batch_size=32):
    model.to(device)
    model.eval()
    
    # Clear memory and get baseline measurements
    gc.collect()
    if torch.cuda.is_available() and device != "cpu":
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats(device)
        gpu_memory_baseline = torch.cuda.memory_allocated(device) / 1e6  # MB
    
    process = psutil.Process(os.getpid())
    cpu_memory_baseline = process.memory_info().rss / 1e6  # MB
    
    # Start tracking
    tracker = EmissionsTracker(project_name="bert_agnews", measure_power_secs=1)
    tracker.start()
    
    start_time = time.time()
    total_samples = len(dataset)
    
    # Collect all predictions and true labels for macro metrics
    all_predictions = []
    all_labels = []
    
    # Track peak memory during inference
    peak_cpu_memory = cpu_memory_baseline
    
    # Evaluate in batches
    for i in tqdm(range(0, total_samples, batch_size)):
        batch = dataset[i: i + batch_size]
        inputs = {
            "input_ids": batch["input_ids"].to(device, dtype=torch.long),
            "attention_mask": batch["attention_mask"].to(device, dtype=torch.long)
        }
        labels = batch["label"].to(device)
        
        with torch.no_grad():
            outputs = model(**inputs).logits
            # Handle different output formats from compressed models
            if hasattr(outputs, 'logits'):
                logits = outputs.logits
            elif isinstance(outputs, torch.Tensor):
                logits = outputs
            elif isinstance(outputs, (tuple, list)) and len(outputs) > 0:
                logits = outputs[0]
            else:
                raise ValueError(f"Unexpected output format: {type(outputs)}")
            
            preds = torch.argmax(outputs, axis=-1)
            
            # Collect predictions and labels for metric calculation
            all_predictions.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
        
        # Track peak CPU memory during inference
        current_cpu_memory = process.memory_info().rss / 1e6
        peak_cpu_memory = max(peak_cpu_memory, current_cpu_memory)
    
    end_time = time.time()
    emissions: float = tracker.stop()
    
    # Calculate macro metrics
    f1_macro = f1_score(all_labels, all_predictions, average='macro')
    precision_macro = precision_score(all_labels, all_predictions, average='macro')
    recall_macro = recall_score(all_labels, all_predictions, average='macro')
    
    # Final memory measurements
    cpu_memory_final = process.memory_info().rss / 1e6  # MB
    cpu_memory_used = peak_cpu_memory - cpu_memory_baseline
    
    metrics = {
        "inference_speed (samples/sec)": total_samples / (end_time - start_time),
        "cpu_memory_used (MB)": cpu_memory_used,
        "cpu_memory_peak (MB)": peak_cpu_memory,
        "carbon_footprint (kg CO2eq)": emissions,
        "f1_macro": f1_macro,
        "precision_macro": precision_macro,
        "recall_macro": recall_macro
    }
    
    # Add GPU memory metrics if using CUDA
    if torch.cuda.is_available() and device != "cpu":
        gpu_memory_peak = torch.cuda.max_memory_allocated(device) / 1e6  # MB
        gpu_memory_current = torch.cuda.memory_allocated(device) / 1e6  # MB
        gpu_memory_used = gpu_memory_peak - gpu_memory_baseline
        
        metrics.update({
            "gpu_memory_used (MB)": gpu_memory_used,
            "gpu_memory_peak (MB)": gpu_memory_peak,
            "gpu_memory_current (MB)": gpu_memory_current
        })
    
    return metrics

# Evaluate on test set with improved memory tracking
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

print("📊 Evaluating Original Model...")
original_metrics = measure_inference_metrics(
    model,
    tokenized_dataset["test"],
    device=device,
    batch_size=32
)
original_eval = trainer.evaluate(eval_dataset=tokenized_dataset["test"])

# Remove duplicate metrics from inference
for metric in ['f1', 'precision', 'recall']:
    if f"{metric}_macro" in original_metrics and f"eval_{metric}" in original_eval:
        del original_metrics[f"{metric}_macro"]

all_metrics = {**original_eval, **original_metrics}

# Display all metrics with better formatting
print("\n" + "="*50)
print("BERT MODEL PERFORMANCE METRICS")
print("="*50)

print("\n📊 Classification Metrics (Macro):")
for k, v in all_metrics.items():
    if any(metric in k.lower() for metric in ['f1', 'precision', 'recall']):
        print(f"  {k}: {v:.4f}" if isinstance(v, float) else f"  {k}: {v}")

print("\n⚡ Performance Metrics:")
for k, v in all_metrics.items():
    if 'speed' in k.lower():
        print(f"  {k}: {v:.2f}")

print("\n💾 Memory Usage:")
for k, v in all_metrics.items():
    if 'memory' in k.lower():
        print(f"  {k}: {v:.2f} MB")

print("\n🌍 Environmental Impact:")
for k, v in all_metrics.items():
    if 'carbon' in k.lower():
        print(f"  {k}: {v:.6f}")

Using device: cuda
📊 Evaluating Original Model...

100%|██████████| 238/238 [00:18<00:00, 12.86it/s]

==================================================
BERT MODEL PERFORMANCE METRICS
==================================================

📊 Classification Metrics (Macro):
  eval_f1: 0.9192
  eval_precision: 0.9192
  eval_recall: 0.9193

⚡ Performance Metrics:
  inference_speed (samples/sec): 410.47

💾 Memory Usage:
  cpu_memory_used (MB): 0.00 MB
  cpu_memory_peak (MB): 2494.89 MB
  gpu_memory_used (MB): 140.58 MB
  gpu_memory_peak (MB): 2160.91 MB
  gpu_memory_current (MB): 2020.37 MB

🌍 Environmental Impact:
  carbon_footprint (kg CO2eq): 0.004168

📊 Performance and Efficiency Metrics for BERT on AG News¶

After fine-tuning bert-base-uncased on the AG News dataset, here’s a summary of the model’s performance and resource usage:

📈 Classification Performance (Macro)¶

F1 Score: 0.9192
Precision: 0.9192
Recall: 0.9193

✅ This shows the BERT model performs extremely well on news topic classification!

⚡ Inference Performance¶

Inference Speed: ~400 samples/second
CPU Memory Peak: 2117 MB
GPU Memory Peak: 1473 MB
GPU Memory Used: 140 MB (during active inference)

These are reasonable for a BERT-base model — showing high throughput and moderate GPU usage.

🌍 Environmental Impact¶

The carbon footprint per individual inference request can be calculated as:

$$\text{CO2eq per request} = \frac{\text{Total Emissions}}{\text{Total Test Samples}}$$

With our results:

Total emissions: 0.004144 kg CO2eq
Test samples: 7,600

$$\text{CO2eq per request} = \frac{0.004144}{7600} \approx \boxed{0.000000545 \text{ kg CO2eq/request}}$$

✈️ Real-World Comparison: Flights and Mass Usage¶

One-way flight Madrid → NYC: ~480 kg CO2eq per passenger
Inference requests equivalent to 1 flight:

$$\frac{480 \text{ kg CO2eq}}{0.000000545 \text{ kg CO2eq/request}} \approx \boxed{880,733,944 \text{ requests}}$$

📊 That's nearly 881 million inference requests to equal one transatlantic flight!

👥 1 Million Users, 10 Requests/Day¶

Daily Usage Calculation¶

Assumptions:

👥 Users: 1,000,000
🔄 Requests per user per day: 10

Daily calculations:

Total daily requests: $1,000,000 \times 10 = 10,000,000 \text{ requests/day}$
Daily CO2 emissions: $10,000,000 \times 0.000000545 \approx \boxed{5.45 \text{ kg CO2eq/day}}$

Annual Environmental Impact¶

Daily flight equivalent: $\frac{5.45}{480} \approx 0.011 \text{ flights/day}$
Annual flight equivalent: $0.011 \times 365 \approx \boxed{4 \text{ flights/year}}$

💡 Takeaway¶

Metric	Value	Interpretation
🔬 Per Request	0.000000545 kg CO2eq	Tiny individual impact
🌐 At Web Scale	5.45 kg CO2eq/day	Significant cumulative impact
✈️ Annual Equivalent	~4 transatlantic flights	Meaningful environmental cost

🔋 Even a single NLP model can produce measurable CO2eq over time, especially at web-scale.
🌍 This reinforces the importance of using pruning and efficient architectures like those we’ll explore with Pruna in the next steps!

Let’s move on to see how we can reduce this carbon footprint while maintaining performance.

🔧 Pruna Compression Strategies¶

Pruna provides several modular compression techniques that can be combined or used individually:

Strategy	What it does
Batcher	Reduces redundant computations across batched inputs.
Pruner	Removes unnecessary weights and neurons, shrinking the model.
Quantizer	Reduces the precision of weights and activations to use less memory.
Cacher	Optimizes repeated computations by caching intermediate results.
Recoverer	Tries to recover performance after aggressive pruning or quantization.

🎯 Our Goal¶

We’ll apply each of these strategies individually to our fine-tuned BERT model on AG News and compare:

✅ Classification Metrics (F1, Precision, Recall)
✅ Inference Speed
✅ RAM/GPU Usage
✅ Carbon Footprint

This will help us quantify trade-offs:

🟢 How much speedup and memory saving do we get?
🔴 How much (if any) accuracy loss?
🌍 How much carbon reduction?

Let’s dive in!

In [53]:

Copied!





import copy
import torch
from pruna import SmashConfig, smash

# Create a calibration dataset from our tokenized data
def create_calibration_dataset():
    """Create a small calibration dataset for algorithms that require it"""
    # Use a subset of training data for calibration
    cal_size = min(1000, len(dataset["train"]))
    cal_train_dataset = dataset["train"].shuffle(seed=42).select(range(cal_size))
    cal_test_dataset = dataset["test"].shuffle(seed=42).select(range(cal_size))
    cal_val_dataset = dataset["test"].shuffle(seed=43).select(range(cal_size, cal_size*2))
    return cal_train_dataset, cal_test_dataset, cal_val_dataset

# Define compression strategies with requirements
compression_strategies = [
    {
        "name": "unstructured_pruning",
        "config_fn": lambda: {
            "pruner": "torch_unstructured"
        },
        "requires": [],
        "description": "Unstructured magnitude pruning (BERT compatible)"
    },
    {
        "name": "half_precision",
        "config_fn": lambda: {
            "quantizer": "half"
        },
        "requires": [],
        "description": "Half precision (FP16) quantization"
    },
    {
        "name": "dynamic_quantization",
        "config_fn": lambda: {
            "quantizer": "torch_dynamic"
        },
        "requires": [],
        "description": "Dynamic quantization (runtime)"
    },
    {
        "name": "llm_int8_quantization",
        "config_fn": lambda: {
            "quantizer": "llm_int8"
        },
        "requires": [],
        "description": "LLM-specific INT8 quantization"
    }
]
import copy
import torch
from pruna import SmashConfig, smash

# Create a calibration dataset from our tokenized data
def create_calibration_dataset():
    """Create a small calibration dataset for algorithms that require it"""
    # Use a subset of training data for calibration
    cal_size = min(1000, len(dataset["train"]))
    cal_train_dataset = dataset["train"].shuffle(seed=42).select(range(cal_size))
    cal_test_dataset = dataset["test"].shuffle(seed=42).select(range(cal_size))
    cal_val_dataset = dataset["test"].shuffle(seed=43).select(range(cal_size, cal_size*2))
    return cal_train_dataset, cal_test_dataset, cal_val_dataset

# Define compression strategies with requirements
compression_strategies = [
    {
        "name": "unstructured_pruning",
        "config_fn": lambda: {
            "pruner": "torch_unstructured"
        },
        "requires": [],
        "description": "Unstructured magnitude pruning (BERT compatible)"
    },
    {
        "name": "half_precision",
        "config_fn": lambda: {
            "quantizer": "half"
        },
        "requires": [],
        "description": "Half precision (FP16) quantization"
    },
    {
        "name": "dynamic_quantization",
        "config_fn": lambda: {
            "quantizer": "torch_dynamic"
        },
        "requires": [],
        "description": "Dynamic quantization (runtime)"
    },
    {
        "name": "llm_int8_quantization",
        "config_fn": lambda: {
            "quantizer": "llm_int8"
        },
        "requires": [],
        "description": "LLM-specific INT8 quantization"
    }
]

In [48]:

Copied!





results_by_strategy = {}

original_all_metrics = {**original_eval, **original_metrics}
results_by_strategy["original"] = original_all_metrics

print(f"✅ Original Model:")
print(f"   F1: {original_eval.get('eval_f1', 'N/A'):.4f}")
print(f"   Speed: {original_metrics.get('inference_speed (samples/sec)', 'N/A'):.2f} samples/sec")
print(f"   Memory: {original_metrics.get('gpu_memory_used (MB)', 'N/A'):.2f} MB")
results_by_strategy = {}

original_all_metrics = {**original_eval, **original_metrics}
results_by_strategy["original"] = original_all_metrics

print(f"✅ Original Model:")
print(f"   F1: {original_eval.get('eval_f1', 'N/A'):.4f}")
print(f"   Speed: {original_metrics.get('inference_speed (samples/sec)', 'N/A'):.2f} samples/sec")
print(f"   Memory: {original_metrics.get('gpu_memory_used (MB)', 'N/A'):.2f} MB")

✅ Original Model:
   F1: 0.9192
   Speed: 407.00 samples/sec
   Memory: 140.58 MB

In [54]:

Copied!





device = "cuda" if torch.cuda.is_available() else "cpu"

for strategy in compression_strategies[1:]:
    strategy_name = strategy["name"]
    config_dict = strategy["config_fn"]()
    requirements = strategy["requires"]

    # Create basic SmashConfig
    smash_config = SmashConfig(
            batch_size=32,
            device=device
        )
    
    print(f"\n🚀 Applying Strategy: {strategy_name.upper().replace('_', ' ')}")

    # Clone the model to avoid modifying original
    model_copy = copy.deepcopy(model)

     # Add required components based on algorithm needs
    if "tokenizer" in requirements:
        smash_config.add_tokenizer(tokenizer)
        print(f"   ✅ Added tokenizer")
    
    if "dataset" in requirements:
        cal_train_dataset, cal_test_dataset, cal_val_dataset = create_calibration_dataset()
        
        # Add dataset using the text classification collate function
        smash_config.add_data(
            (cal_train_dataset, cal_val_dataset, cal_test_dataset),  # (train, val, test) - only train needed for calibration
            collate_fn="text_generation_collate"  # Use text generation collate for BERT
        )
        print(f"   ✅ Added calibration dataset ({len(cal_train_dataset)} samples)")
    
    # Configure algorithms using dictionary syntax as per documentation
    for algo_type, algo_name in config_dict.items():
        smash_config[algo_type] = algo_name
        print(f"   Configured {algo_type}: {algo_name}")
    
    # Apply compression using correct Pruna API
    compressed_pruna_model = smash(
        model=model_copy,
        smash_config=smash_config
    )

    # Extract the underlying PyTorch model from PrunaModel wrapper
    if hasattr(compressed_pruna_model, 'model'):
        compressed_model = compressed_pruna_model.model
    elif hasattr(compressed_pruna_model, '_model'):
        compressed_model = compressed_pruna_model._model
    else:
        # If we can't extract, use the PrunaModel directly for inference
        compressed_model = compressed_pruna_model
    
    # Ensure model is on correct device and in eval mode
    compressed_model.to(device)
    compressed_model.eval()
    
    print(f"✅ Successfully applied {strategy_name}")
    
    # Measure performance metrics
    metrics = measure_inference_metrics(
        compressed_model,
        tokenized_dataset["test"],
        device=device,
        batch_size=32
    )
    
    temp_args = TrainingArguments(
        output_dir="./temp",
        per_device_eval_batch_size=64,
        report_to="none"
    )
    
    temp_trainer = Trainer(
        model=compressed_model,
        args=temp_args,
        eval_dataset=tokenized_dataset["test"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    
    eval_metrics = temp_trainer.evaluate()
    
    # Remove duplicate metrics
    for metric in ['f1', 'precision', 'recall']:
        if f"{metric}_macro" in metrics and f"eval_{metric}" in eval_metrics:
            del metrics[f"{metric}_macro"]
    
    # Combine all metrics
    all_metrics = {**eval_metrics, **metrics}
    results_by_strategy[strategy_name] = all_metrics
    
    print(f"   📊 F1: {eval_metrics.get('eval_f1', 'N/A'):.4f}")
    print(f"   🚀 Speed: {metrics.get('inference_speed (samples/sec)', 'N/A'):.2f} samples/sec")
    print(f"   💾 Memory: {metrics.get('gpu_memory_used (MB)', 'N/A'):.2f} MB")
    print(f"   🌍 Carbon: {metrics.get('carbon_footprint (kg CO2eq)', 'N/A'):.6f} kg CO2eq")
device = "cuda" if torch.cuda.is_available() else "cpu"

for strategy in compression_strategies[1:]:
    strategy_name = strategy["name"]
    config_dict = strategy["config_fn"]()
    requirements = strategy["requires"]

    # Create basic SmashConfig
    smash_config = SmashConfig(
            batch_size=32,
            device=device
        )
    
    print(f"\n🚀 Applying Strategy: {strategy_name.upper().replace('_', ' ')}")

    # Clone the model to avoid modifying original
    model_copy = copy.deepcopy(model)

     # Add required components based on algorithm needs
    if "tokenizer" in requirements:
        smash_config.add_tokenizer(tokenizer)
        print(f"   ✅ Added tokenizer")
    
    if "dataset" in requirements:
        cal_train_dataset, cal_test_dataset, cal_val_dataset = create_calibration_dataset()
        
        # Add dataset using the text classification collate function
        smash_config.add_data(
            (cal_train_dataset, cal_val_dataset, cal_test_dataset),  # (train, val, test) - only train needed for calibration
            collate_fn="text_generation_collate"  # Use text generation collate for BERT
        )
        print(f"   ✅ Added calibration dataset ({len(cal_train_dataset)} samples)")
    
    # Configure algorithms using dictionary syntax as per documentation
    for algo_type, algo_name in config_dict.items():
        smash_config[algo_type] = algo_name
        print(f"   Configured {algo_type}: {algo_name}")
    
    # Apply compression using correct Pruna API
    compressed_pruna_model = smash(
        model=model_copy,
        smash_config=smash_config
    )

    # Extract the underlying PyTorch model from PrunaModel wrapper
    if hasattr(compressed_pruna_model, 'model'):
        compressed_model = compressed_pruna_model.model
    elif hasattr(compressed_pruna_model, '_model'):
        compressed_model = compressed_pruna_model._model
    else:
        # If we can't extract, use the PrunaModel directly for inference
        compressed_model = compressed_pruna_model
    
    # Ensure model is on correct device and in eval mode
    compressed_model.to(device)
    compressed_model.eval()
    
    print(f"✅ Successfully applied {strategy_name}")
    
    # Measure performance metrics
    metrics = measure_inference_metrics(
        compressed_model,
        tokenized_dataset["test"],
        device=device,
        batch_size=32
    )
    
    temp_args = TrainingArguments(
        output_dir="./temp",
        per_device_eval_batch_size=64,
        report_to="none"
    )
    
    temp_trainer = Trainer(
        model=compressed_model,
        args=temp_args,
        eval_dataset=tokenized_dataset["test"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    
    eval_metrics = temp_trainer.evaluate()
    
    # Remove duplicate metrics
    for metric in ['f1', 'precision', 'recall']:
        if f"{metric}_macro" in metrics and f"eval_{metric}" in eval_metrics:
            del metrics[f"{metric}_macro"]
    
    # Combine all metrics
    all_metrics = {**eval_metrics, **metrics}
    results_by_strategy[strategy_name] = all_metrics
    
    print(f"   📊 F1: {eval_metrics.get('eval_f1', 'N/A'):.4f}")
    print(f"   🚀 Speed: {metrics.get('inference_speed (samples/sec)', 'N/A'):.2f} samples/sec")
    print(f"   💾 Memory: {metrics.get('gpu_memory_used (MB)', 'N/A'):.2f} MB")
    print(f"   🌍 Carbon: {metrics.get('carbon_footprint (kg CO2eq)', 'N/A'):.6f} kg CO2eq")

🚀 Applying Strategy: HALF PRECISION
   Configured quantizer: half

INFO - Starting quantizer half...
INFO - quantizer half was applied successfully.

✅ Successfully applied half_precision

  0%|          | 0/238 [00:00<?, ?it/s]

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[54], line 61
     58 print(f"✅ Successfully applied {strategy_name}")
     60 # Measure performance metrics
---> 61 metrics = measure_inference_metrics(
     62     compressed_model,
     63     tokenized_dataset["test"],
     64     device=device,
     65     batch_size=32
     66 )
     68 temp_args = TrainingArguments(
     69     output_dir="./temp",
     70     per_device_eval_batch_size=64,
     71     report_to="none"
     72 )
     74 temp_trainer = Trainer(
     75     model=compressed_model,
     76     args=temp_args,
   (...)
     79     compute_metrics=compute_metrics
     80 )

Cell In[52], line 53, in measure_inference_metrics(model, dataset, device, batch_size)
     50 labels = batch["label"].to(device)
     52 with torch.no_grad():
---> 53     outputs = model(**inputs).logits
     54     # Handle different output formats from compressed models
     55     if hasattr(outputs, 'logits'):

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs)
   1749     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1750 else:
-> 1751     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs)
   1757 # If we don't have any hooks, we want to skip the rest of the logic in
   1758 # this function, and just call forward.
   1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1760         or _global_backward_pre_hooks or _global_backward_hooks
   1761         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1762     return forward_call(*args, **kwargs)
   1764 result = None
   1765 called_always_called_hooks = set()

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/pruna/algorithms/quantization/half.py:106, in HalfQuantizer._apply.<locals>.new_forward(*args, **kwargs)
    104 args = tuple(arg.half() if hasattr(arg, "half") else arg for arg in args)
    105 kwargs = {k: v.half() if hasattr(v, "half") else v for k, v in kwargs.items()}
--> 106 return original_forward(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1675, in BertForSequenceClassification.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
   1667 r"""
   1668 labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
   1669     Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
   1670     config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
   1671     `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
   1672 """
   1673 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-> 1675 outputs = self.bert(
   1676     input_ids,
   1677     attention_mask=attention_mask,
   1678     token_type_ids=token_type_ids,
   1679     position_ids=position_ids,
   1680     head_mask=head_mask,
   1681     inputs_embeds=inputs_embeds,
   1682     output_attentions=output_attentions,
   1683     output_hidden_states=output_hidden_states,
   1684     return_dict=return_dict,
   1685 )
   1687 pooled_output = outputs[1]
   1689 pooled_output = self.dropout(pooled_output)

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs)
   1749     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1750 else:
-> 1751     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs)
   1757 # If we don't have any hooks, we want to skip the rest of the logic in
   1758 # this function, and just call forward.
   1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1760         or _global_backward_pre_hooks or _global_backward_hooks
   1761         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1762     return forward_call(*args, **kwargs)
   1764 result = None
   1765 called_always_called_hooks = set()

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1080, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1077     else:
   1078         token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
-> 1080 embedding_output = self.embeddings(
   1081     input_ids=input_ids,
   1082     position_ids=position_ids,
   1083     token_type_ids=token_type_ids,
   1084     inputs_embeds=inputs_embeds,
   1085     past_key_values_length=past_key_values_length,
   1086 )
   1088 if attention_mask is None:
   1089     attention_mask = torch.ones((batch_size, seq_length + past_key_values_length), device=device)

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs)
   1749     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1750 else:
-> 1751     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs)
   1757 # If we don't have any hooks, we want to skip the rest of the logic in
   1758 # this function, and just call forward.
   1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1760         or _global_backward_pre_hooks or _global_backward_hooks
   1761         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1762     return forward_call(*args, **kwargs)
   1764 result = None
   1765 called_always_called_hooks = set()

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:211, in BertEmbeddings.forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
    208         token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
    210 if inputs_embeds is None:
--> 211     inputs_embeds = self.word_embeddings(input_ids)
    212 token_type_embeddings = self.token_type_embeddings(token_type_ids)
    214 embeddings = inputs_embeds + token_type_embeddings

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs)
   1749     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1750 else:
-> 1751     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs)
   1757 # If we don't have any hooks, we want to skip the rest of the logic in
   1758 # this function, and just call forward.
   1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1760         or _global_backward_pre_hooks or _global_backward_hooks
   1761         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1762     return forward_call(*args, **kwargs)
   1764 result = None
   1765 called_always_called_hooks = set()

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py:190, in Embedding.forward(self, input)
    189 def forward(self, input: Tensor) -> Tensor:
--> 190     return F.embedding(
    191         input,
    192         self.weight,
    193         self.padding_idx,
    194         self.max_norm,
    195         self.norm_type,
    196         self.scale_grad_by_freq,
    197         self.sparse,
    198     )

File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/functional.py:2551, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2545     # Note [embedding_renorm set_grad_enabled]
   2546     # XXX: equivalent to
   2547     # with torch.no_grad():
   2548     #   torch.embedding_renorm_
   2549     # remove once script supports set_grad_enabled
   2550     _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2551 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

In [10]:

Copied!





# Check what's available in pruna package
import pruna
print("Available in pruna package:")
pruna_attrs = [attr for attr in dir(pruna) if not attr.startswith('_')]
for attr in pruna_attrs:
    print(f"  - {attr}")
    
# Check for common API patterns
print(f"\nChecking API structure:")
if hasattr(pruna, 'compress'):
    print("✅ pruna.compress available")
if hasattr(pruna, 'optimize'):
    print("✅ pruna.optimize available") 
if hasattr(pruna, 'PruningConfig'):
    print("✅ pruna.PruningConfig available")
if hasattr(pruna, 'prune'):
    print("✅ pruna.prune available")
# Check what's available in pruna package
import pruna
print("Available in pruna package:")
pruna_attrs = [attr for attr in dir(pruna) if not attr.startswith('_')]
for attr in pruna_attrs:
    print(f"  - {attr}")
    
# Check for common API patterns
print(f"\nChecking API structure:")
if hasattr(pruna, 'compress'):
    print("✅ pruna.compress available")
if hasattr(pruna, 'optimize'):
    print("✅ pruna.optimize available") 
if hasattr(pruna, 'PruningConfig'):
    print("✅ pruna.PruningConfig available")
if hasattr(pruna, 'prune'):
    print("✅ pruna.prune available")

Available in pruna package:
  - PRUNA_ALGORITHMS
  - PrunaModel
  - SmashConfig
  - algorithms
  - config
  - data
  - engine
  - logging
  - smash
  - telemetry
  - version

Checking API structure:

In [11]:

Copied!





# Check what algorithms are available
print("Available Pruna algorithms:")
print(pruna.PRUNA_ALGORITHMS)

# Check what's in the smash module
print(f"\nSmash module contents:")
smash_attrs = [attr for attr in dir(pruna.smash) if not attr.startswith('_')]
for attr in smash_attrs:
    print(f"  - {attr}")
# Check what algorithms are available
print("Available Pruna algorithms:")
print(pruna.PRUNA_ALGORITHMS)

# Check what's in the smash module
print(f"\nSmash module contents:")
smash_attrs = [attr for attr in dir(pruna.smash) if not attr.startswith('_')]
for attr in smash_attrs:
    print(f"  - {attr}")

Available Pruna algorithms:
{'factorizer': {'qkv_diffusers': <pruna.algorithms.factorizing.qkv_diffusers.QKVDiffusers object at 0x7e46349556f0>}, 'pruner': {'torch_structured': <pruna.algorithms.pruning.torch_structured.TorchStructuredPruner object at 0x7e4634957fd0>, 'torch_unstructured': <pruna.algorithms.pruning.torch_unstructured.TorchUnstructuredPruner object at 0x7e46343adbd0>}, 'quantizer': {'gptq': <pruna.algorithms.quantization.gptq_model.GPTQQuantizer object at 0x7e4627f44160>, 'half': <pruna.algorithms.quantization.half.HalfQuantizer object at 0x7e46343aceb0>, 'hqq': <pruna.algorithms.quantization.hqq.HQQQuantizer object at 0x7e4634124610>, 'hqq_diffusers': <pruna.algorithms.quantization.hqq_diffusers.HQQDiffusersQuantizer object at 0x7e4627f034c0>, 'awq': <pruna.algorithms.quantization.huggingface_awq.AWQQuantizer object at 0x7e46343983a0>, 'diffusers_int8': <pruna.algorithms.quantization.huggingface_diffusers_int8.DiffusersInt8Quantizer object at 0x7e4634398370>, 'llm_int8': <pruna.algorithms.quantization.huggingface_llm_int8.LLMInt8Quantizer object at 0x7e4634646ce0>, 'quanto': <pruna.algorithms.quantization.quanto.QuantoQuantizer object at 0x7e4626b14e50>, 'torch_dynamic': <pruna.algorithms.quantization.torch_dynamic.TorchDynamicQuantizer object at 0x7e4626be4a60>, 'torch_static': <pruna.algorithms.quantization.torch_static.TorchStaticQuantizer object at 0x7e46342239d0>, 'torchao': <pruna.algorithms.quantization.torchao.TorchaoQuantizer object at 0x7e4634583370>}, 'cacher': {'deepcache': <pruna.algorithms.caching.deepcache.DeepCacheCacher object at 0x7e46360faa70>, 'fastercache': <pruna.algorithms.caching.fastercache.FasterCacheCacher object at 0x7e46360d7a60>, 'fora': <pruna.algorithms.caching.fora.FORACacher object at 0x7e46349871f0>, 'pab': <pruna.algorithms.caching.pab.PABCacher object at 0x7e46363b8040>}, 'compiler': {'c_generate': <pruna.algorithms.compilation.c_translate.CGenerateCompiler object at 0x7e46361be740>, 'c_translate': <pruna.algorithms.compilation.c_translate.CTranslateCompiler object at 0x7e46361bd690>, 'c_whisper': <pruna.algorithms.compilation.c_translate.CWhisperCompiler object at 0x7e46361be500>, 'stable_fast': <pruna.algorithms.compilation.stable_fast.StableFastCompiler object at 0x7e463476ac20>, 'torch_compile': <pruna.algorithms.compilation.torch_compile.TorchCompileCompiler object at 0x7e46361bdf30>}, 'batcher': {'ifw': <pruna.algorithms.batching.ifw.IFWBatcher object at 0x7e46363a7790>, 'whisper_s2t': <pruna.algorithms.batching.ws2t.WS2TBatcher object at 0x7e463487d840>}}

Smash module contents:

In [13]:

Copied!





# Check SmashConfig parameters
from pruna import SmashConfig
import inspect

print("SmashConfig constructor signature:")
print(inspect.signature(SmashConfig.__init__))

# Check SmashConfig documentation/attributes
print("\nSmashConfig attributes:")
config_attrs = [attr for attr in dir(SmashConfig) if not attr.startswith('_')]
for attr in config_attrs:
    print(f"  - {attr}")
# Check SmashConfig parameters
from pruna import SmashConfig
import inspect

print("SmashConfig constructor signature:")
print(inspect.signature(SmashConfig.__init__))

# Check SmashConfig documentation/attributes
print("\nSmashConfig attributes:")
config_attrs = [attr for attr in dir(SmashConfig) if not attr.startswith('_')]
for attr in config_attrs:
    print(f"  - {attr}")

SmashConfig constructor signature:
(self, max_batch_size: 'int | None' = None, batch_size: 'int' = 1, device: 'str | torch.device | None' = None, cache_dir_prefix: 'str' = '/home/ubuntu/.cache/pruna', configuration: 'Configuration | None' = None) -> 'None'

SmashConfig attributes:
  - add_data
  - add_processor
  - add_tokenizer
  - cleanup_cache_dir
  - flush_configuration
  - get_tokenizer_name
  - is_batch_size_locked
  - load_dict
  - load_from_json
  - lock_batch_size
  - reset_cache_dir
  - save_to_json
  - test_dataloader
  - train_dataloader
  - val_dataloader

In [15]:

Copied!





# Check pruna.smash function signature
import inspect
print("pruna.smash signature:")
try:
    print(inspect.signature(pruna.smash))
except:
    print("Could not get signature, let's check attributes:")
    print([attr for attr in dir(pruna.smash) if not attr.startswith('_')])
# Check pruna.smash function signature
import inspect
print("pruna.smash signature:")
try:
    print(inspect.signature(pruna.smash))
except:
    print("Could not get signature, let's check attributes:")
    print([attr for attr in dir(pruna.smash) if not attr.startswith('_')])

pruna.smash signature:
(model: Any, smash_config: pruna.config.smash_config.SmashConfig, verbose: bool = False, experimental: bool = False) -> pruna.engine.pruna_model.PrunaModel

In [ ]: