⚡ Reducing BERT with Pruna: Efficiency Meets Performance¶
In this session, we explore model compression and efficiency using the powerful library Pruna. I recommend to use a GPU for this session.
Introduction¶
🌍 What is Pruna?¶
Pruna is an open-source Python library for neural network compression and acceleration. It enables you to:
- 🔧 Reduce model size by pruning unnecessary weights and neurons,
- ⚡ Accelerate inference by creating smaller, faster models,
- ♻️ Lower carbon footprint and memory usage, making NLP more sustainable.
Pruna supports various pruning strategies (like structured and unstructured pruning) and allows you to analyze trade-offs between performance and resource efficiency.
🧪 What We'll Do in This Notebook¶
| Step | Description |
|---|---|
| 1. Fine-tune a BERT model | Train bert-base-uncased on the AG News dataset for news classification. |
| 2. Compress with Pruna | Apply Pruna's pruning techniques to the fine-tuned model. |
| 3. Evaluate Metrics | Measure: - Inference speed - RAM usage - Carbon footprint - F1, Precision, Recall (macro) |
| 4. Compare Trade-offs | Identify how much we can compress the model while maintaining good accuracy. |
🎯 Learning Objectives¶
- Understand how pruning can reduce transformer models' size and compute.
- Measure resource-related impacts (like carbon footprint) of pruning.
- Learn to balance model performance and sustainability in NLP.
Let’s get started by loading the AG News dataset and preparing our fine-tuning pipeline!
from datasets import load_dataset
from collections import Counter
# Load AG News dataset
dataset = load_dataset("ag_news")
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(10000))
# Show train/test sizes
print("Train size:", len(dataset["train"]))
print("Test size:", len(dataset["test"]))
# Show label distribution
print("\nLabel distribution:")
label_names = dataset["train"].features["label"].names
label_counts = Counter(dataset["train"]["label"])
for i, label_name in enumerate(label_names):
count = label_counts[i]
print(f"{label_name}: {count}")
Train size: 10000 Test size: 7600 Label distribution: World: 2530 Sports: 2528 Business: 2407 Sci/Tech: 2535
BERT Model on AG News¶
🏋️♂️ 1️⃣ Fine-Tune BERT on AG News¶
- We start by fine-tuning the
bert-base-uncasedmodel using the AG News dataset. - This will give us a baseline model that’s optimized for news topic classification.
🧪 2️⃣ Evaluate Core Performance Metrics¶
Once we have a fine-tuned model, we’ll evaluate the following traditional NLP metrics:
✅ F1 Score (macro) – Overall balance across classes
✅ Precision (macro) – How precise the predictions are
✅ Recall (macro) – How well the model captures all relevant samples
These help us ensure our model is performant before we start optimizing for size and speed.
🌱 3️⃣ Analyze Environmental and Efficiency Metrics¶
Next, we’ll use Pruna to prune and compress the BERT model, making it faster and smaller. We’ll then evaluate:
- 💨 Inference Speed – How quickly the model predicts on new data.
- 💻 RAM Usage – Memory footprint during inference.
- 🌍 Carbon Footprint – CO2-equivalent emissions per request.
These metrics are essential for deploying sustainable NLP models in real-world applications.
🚀 Goal¶
By the end of this notebook, you’ll learn how to:
- Build a high-quality news classifier with BERT,
- Compress it with Pruna for real-world deployment,
- And measure the full spectrum of trade-offs: accuracy, speed, carbon impact, and memory.
Let’s get started!
from transformers import AutoTokenizer
# Load tokenizer for BERT-base-uncased
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Tokenization function for the dataset
def tokenize_batch(batch):
return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=128)
# Apply tokenization to train and test splits
tokenized_dataset = dataset.map(tokenize_batch, batched=True)
# Set format for PyTorch
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
# Quick check
print(tokenized_dataset["train"][0])
{'label': tensor(0), 'input_ids': tensor([ 101, 7269, 11498, 2135, 6924, 2011, 9326, 4559, 10134, 2031,
2716, 2116, 4865, 1998, 3655, 1999, 7269, 2000, 1037, 9190,
1010, 1996, 2154, 2044, 2324, 2111, 2351, 1999, 18217, 2012,
1037, 2576, 8320, 1012, 102, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])}
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import numpy as np
# Load pretrained model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=4)
# Define compute_metrics for macro evaluation
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
precision = precision_score(labels, predictions, average='macro')
recall = recall_score(labels, predictions, average='macro')
f1 = f1_score(labels, predictions, average='macro')
acc = accuracy_score(labels, predictions)
return {
"accuracy": acc,
"f1": f1,
"precision": precision,
"recall": recall
}
# Training setup
training_args = TrainingArguments(
output_dir="./bert_agnews",
eval_strategy="epoch",
save_strategy="no",
learning_rate=2e-5,
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
num_train_epochs=10,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=200,
report_to="none",
seed=42,
gradient_accumulation_steps=2
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
# 🚀 Train
trainer.train()
2025-05-24 14:34:56.822356: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2025-05-24 14:34:56.840019: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1748097296.857354 7030 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1748097296.862047 7030 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered W0000 00:00:1748097296.874036 7030 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1748097296.874050 7030 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1748097296.874051 7030 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1748097296.874053 7030 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. 2025-05-24 14:34:56.878374: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. /tmp/ipykernel_7030/1089717668.py:43: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. trainer = Trainer(
| Epoch | Training Loss | Validation Loss | Accuracy | F1 | Precision | Recall |
|---|---|---|---|---|---|---|
| 1 | No log | 0.323253 | 0.895921 | 0.895657 | 0.898335 | 0.895921 |
| 2 | No log | 0.286806 | 0.902368 | 0.901983 | 0.908402 | 0.902368 |
| 3 | 0.375100 | 0.249602 | 0.919474 | 0.919110 | 0.919976 | 0.919474 |
| 4 | 0.375100 | 0.255818 | 0.918553 | 0.918654 | 0.918782 | 0.918553 |
| 5 | 0.375100 | 0.257930 | 0.922237 | 0.922311 | 0.922678 | 0.922237 |
| 6 | 0.126100 | 0.279408 | 0.920526 | 0.920448 | 0.920480 | 0.920526 |
| 7 | 0.126100 | 0.312654 | 0.915132 | 0.914818 | 0.915857 | 0.915132 |
| 8 | 0.056300 | 0.316371 | 0.918947 | 0.918810 | 0.918995 | 0.918947 |
| 9 | 0.056300 | 0.329148 | 0.919342 | 0.919221 | 0.919192 | 0.919342 |
TrainOutput(global_step=780, training_loss=0.1495015780131022, metrics={'train_runtime': 870.928, 'train_samples_per_second': 114.82, 'train_steps_per_second': 0.896, 'total_flos': 6501064694611968.0, 'train_loss': 0.1495015780131022, 'epoch': 9.878980891719745})
import time
import psutil
import os
import gc
import logging
from codecarbon import EmissionsTracker
from tqdm import tqdm
import torch
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score
logging.getLogger("codecarbon").disabled = True
# Improved helper function to measure inference speed and memory usage
def measure_inference_metrics(model, dataset, device="cpu", batch_size=32):
model.to(device)
model.eval()
# Clear memory and get baseline measurements
gc.collect()
if torch.cuda.is_available() and device != "cpu":
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats(device)
gpu_memory_baseline = torch.cuda.memory_allocated(device) / 1e6 # MB
process = psutil.Process(os.getpid())
cpu_memory_baseline = process.memory_info().rss / 1e6 # MB
# Start tracking
tracker = EmissionsTracker(project_name="bert_agnews", measure_power_secs=1)
tracker.start()
start_time = time.time()
total_samples = len(dataset)
# Collect all predictions and true labels for macro metrics
all_predictions = []
all_labels = []
# Track peak memory during inference
peak_cpu_memory = cpu_memory_baseline
# Evaluate in batches
for i in tqdm(range(0, total_samples, batch_size)):
batch = dataset[i: i + batch_size]
inputs = {
"input_ids": batch["input_ids"].to(device, dtype=torch.long),
"attention_mask": batch["attention_mask"].to(device, dtype=torch.long)
}
labels = batch["label"].to(device)
with torch.no_grad():
outputs = model(**inputs).logits
# Handle different output formats from compressed models
if hasattr(outputs, 'logits'):
logits = outputs.logits
elif isinstance(outputs, torch.Tensor):
logits = outputs
elif isinstance(outputs, (tuple, list)) and len(outputs) > 0:
logits = outputs[0]
else:
raise ValueError(f"Unexpected output format: {type(outputs)}")
preds = torch.argmax(outputs, axis=-1)
# Collect predictions and labels for metric calculation
all_predictions.extend(preds.cpu().numpy())
all_labels.extend(labels.cpu().numpy())
# Track peak CPU memory during inference
current_cpu_memory = process.memory_info().rss / 1e6
peak_cpu_memory = max(peak_cpu_memory, current_cpu_memory)
end_time = time.time()
emissions: float = tracker.stop()
# Calculate macro metrics
f1_macro = f1_score(all_labels, all_predictions, average='macro')
precision_macro = precision_score(all_labels, all_predictions, average='macro')
recall_macro = recall_score(all_labels, all_predictions, average='macro')
# Final memory measurements
cpu_memory_final = process.memory_info().rss / 1e6 # MB
cpu_memory_used = peak_cpu_memory - cpu_memory_baseline
metrics = {
"inference_speed (samples/sec)": total_samples / (end_time - start_time),
"cpu_memory_used (MB)": cpu_memory_used,
"cpu_memory_peak (MB)": peak_cpu_memory,
"carbon_footprint (kg CO2eq)": emissions,
"f1_macro": f1_macro,
"precision_macro": precision_macro,
"recall_macro": recall_macro
}
# Add GPU memory metrics if using CUDA
if torch.cuda.is_available() and device != "cpu":
gpu_memory_peak = torch.cuda.max_memory_allocated(device) / 1e6 # MB
gpu_memory_current = torch.cuda.memory_allocated(device) / 1e6 # MB
gpu_memory_used = gpu_memory_peak - gpu_memory_baseline
metrics.update({
"gpu_memory_used (MB)": gpu_memory_used,
"gpu_memory_peak (MB)": gpu_memory_peak,
"gpu_memory_current (MB)": gpu_memory_current
})
return metrics
# Evaluate on test set with improved memory tracking
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
print("📊 Evaluating Original Model...")
original_metrics = measure_inference_metrics(
model,
tokenized_dataset["test"],
device=device,
batch_size=32
)
original_eval = trainer.evaluate(eval_dataset=tokenized_dataset["test"])
# Remove duplicate metrics from inference
for metric in ['f1', 'precision', 'recall']:
if f"{metric}_macro" in original_metrics and f"eval_{metric}" in original_eval:
del original_metrics[f"{metric}_macro"]
all_metrics = {**original_eval, **original_metrics}
# Display all metrics with better formatting
print("\n" + "="*50)
print("BERT MODEL PERFORMANCE METRICS")
print("="*50)
print("\n📊 Classification Metrics (Macro):")
for k, v in all_metrics.items():
if any(metric in k.lower() for metric in ['f1', 'precision', 'recall']):
print(f" {k}: {v:.4f}" if isinstance(v, float) else f" {k}: {v}")
print("\n⚡ Performance Metrics:")
for k, v in all_metrics.items():
if 'speed' in k.lower():
print(f" {k}: {v:.2f}")
print("\n💾 Memory Usage:")
for k, v in all_metrics.items():
if 'memory' in k.lower():
print(f" {k}: {v:.2f} MB")
print("\n🌍 Environmental Impact:")
for k, v in all_metrics.items():
if 'carbon' in k.lower():
print(f" {k}: {v:.6f}")
Using device: cuda 📊 Evaluating Original Model...
100%|██████████| 238/238 [00:18<00:00, 12.86it/s]
================================================== BERT MODEL PERFORMANCE METRICS ================================================== 📊 Classification Metrics (Macro): eval_f1: 0.9192 eval_precision: 0.9192 eval_recall: 0.9193 ⚡ Performance Metrics: inference_speed (samples/sec): 410.47 💾 Memory Usage: cpu_memory_used (MB): 0.00 MB cpu_memory_peak (MB): 2494.89 MB gpu_memory_used (MB): 140.58 MB gpu_memory_peak (MB): 2160.91 MB gpu_memory_current (MB): 2020.37 MB 🌍 Environmental Impact: carbon_footprint (kg CO2eq): 0.004168
📊 Performance and Efficiency Metrics for BERT on AG News¶
After fine-tuning bert-base-uncased on the AG News dataset, here’s a summary of the model’s performance and resource usage:
📈 Classification Performance (Macro)¶
- F1 Score: 0.9192
- Precision: 0.9192
- Recall: 0.9193
✅ This shows the BERT model performs extremely well on news topic classification!
⚡ Inference Performance¶
- Inference Speed: ~400 samples/second
- CPU Memory Peak: 2117 MB
- GPU Memory Peak: 1473 MB
- GPU Memory Used: 140 MB (during active inference)
These are reasonable for a BERT-base model — showing high throughput and moderate GPU usage.
🌍 Environmental Impact¶
The carbon footprint per individual inference request can be calculated as:
$$\text{CO2eq per request} = \frac{\text{Total Emissions}}{\text{Total Test Samples}}$$
With our results:
- Total emissions: 0.004144 kg CO2eq
- Test samples: 7,600
$$\text{CO2eq per request} = \frac{0.004144}{7600} \approx \boxed{0.000000545 \text{ kg CO2eq/request}}$$
✈️ Real-World Comparison: Flights and Mass Usage¶
- One-way flight Madrid → NYC: ~480 kg CO2eq per passenger
- Inference requests equivalent to 1 flight:
$$\frac{480 \text{ kg CO2eq}}{0.000000545 \text{ kg CO2eq/request}} \approx \boxed{880,733,944 \text{ requests}}$$
📊 That's nearly 881 million inference requests to equal one transatlantic flight!
👥 1 Million Users, 10 Requests/Day¶
Daily Usage Calculation¶
Assumptions:
- 👥 Users: 1,000,000
- 🔄 Requests per user per day: 10
Daily calculations:
- Total daily requests: $1,000,000 \times 10 = 10,000,000 \text{ requests/day}$
- Daily CO2 emissions: $10,000,000 \times 0.000000545 \approx \boxed{5.45 \text{ kg CO2eq/day}}$
Annual Environmental Impact¶
- Daily flight equivalent: $\frac{5.45}{480} \approx 0.011 \text{ flights/day}$
- Annual flight equivalent: $0.011 \times 365 \approx \boxed{4 \text{ flights/year}}$
💡 Takeaway¶
| Metric | Value | Interpretation |
|---|---|---|
| 🔬 Per Request | 0.000000545 kg CO2eq | Tiny individual impact |
| 🌐 At Web Scale | 5.45 kg CO2eq/day | Significant cumulative impact |
| ✈️ Annual Equivalent | ~4 transatlantic flights | Meaningful environmental cost |
🔋 Even a single NLP model can produce measurable CO2eq over time, especially at web-scale.
🌍 This reinforces the importance of using pruning and efficient architectures like those we’ll explore with Pruna in the next steps!
Let’s move on to see how we can reduce this carbon footprint while maintaining performance.
🔧 Pruna Compression Strategies¶
Pruna provides several modular compression techniques that can be combined or used individually:
| Strategy | What it does |
|---|---|
| Batcher | Reduces redundant computations across batched inputs. |
| Pruner | Removes unnecessary weights and neurons, shrinking the model. |
| Quantizer | Reduces the precision of weights and activations to use less memory. |
| Cacher | Optimizes repeated computations by caching intermediate results. |
| Recoverer | Tries to recover performance after aggressive pruning or quantization. |
🎯 Our Goal¶
We’ll apply each of these strategies individually to our fine-tuned BERT model on AG News and compare:
✅ Classification Metrics (F1, Precision, Recall)
✅ Inference Speed
✅ RAM/GPU Usage
✅ Carbon Footprint
This will help us quantify trade-offs:
- 🟢 How much speedup and memory saving do we get?
- 🔴 How much (if any) accuracy loss?
- 🌍 How much carbon reduction?
Let’s dive in!
import copy
import torch
from pruna import SmashConfig, smash
# Create a calibration dataset from our tokenized data
def create_calibration_dataset():
"""Create a small calibration dataset for algorithms that require it"""
# Use a subset of training data for calibration
cal_size = min(1000, len(dataset["train"]))
cal_train_dataset = dataset["train"].shuffle(seed=42).select(range(cal_size))
cal_test_dataset = dataset["test"].shuffle(seed=42).select(range(cal_size))
cal_val_dataset = dataset["test"].shuffle(seed=43).select(range(cal_size, cal_size*2))
return cal_train_dataset, cal_test_dataset, cal_val_dataset
# Define compression strategies with requirements
compression_strategies = [
{
"name": "unstructured_pruning",
"config_fn": lambda: {
"pruner": "torch_unstructured"
},
"requires": [],
"description": "Unstructured magnitude pruning (BERT compatible)"
},
{
"name": "half_precision",
"config_fn": lambda: {
"quantizer": "half"
},
"requires": [],
"description": "Half precision (FP16) quantization"
},
{
"name": "dynamic_quantization",
"config_fn": lambda: {
"quantizer": "torch_dynamic"
},
"requires": [],
"description": "Dynamic quantization (runtime)"
},
{
"name": "llm_int8_quantization",
"config_fn": lambda: {
"quantizer": "llm_int8"
},
"requires": [],
"description": "LLM-specific INT8 quantization"
}
]
results_by_strategy = {}
original_all_metrics = {**original_eval, **original_metrics}
results_by_strategy["original"] = original_all_metrics
print(f"✅ Original Model:")
print(f" F1: {original_eval.get('eval_f1', 'N/A'):.4f}")
print(f" Speed: {original_metrics.get('inference_speed (samples/sec)', 'N/A'):.2f} samples/sec")
print(f" Memory: {original_metrics.get('gpu_memory_used (MB)', 'N/A'):.2f} MB")
✅ Original Model: F1: 0.9192 Speed: 407.00 samples/sec Memory: 140.58 MB
device = "cuda" if torch.cuda.is_available() else "cpu"
for strategy in compression_strategies[1:]:
strategy_name = strategy["name"]
config_dict = strategy["config_fn"]()
requirements = strategy["requires"]
# Create basic SmashConfig
smash_config = SmashConfig(
batch_size=32,
device=device
)
print(f"\n🚀 Applying Strategy: {strategy_name.upper().replace('_', ' ')}")
# Clone the model to avoid modifying original
model_copy = copy.deepcopy(model)
# Add required components based on algorithm needs
if "tokenizer" in requirements:
smash_config.add_tokenizer(tokenizer)
print(f" ✅ Added tokenizer")
if "dataset" in requirements:
cal_train_dataset, cal_test_dataset, cal_val_dataset = create_calibration_dataset()
# Add dataset using the text classification collate function
smash_config.add_data(
(cal_train_dataset, cal_val_dataset, cal_test_dataset), # (train, val, test) - only train needed for calibration
collate_fn="text_generation_collate" # Use text generation collate for BERT
)
print(f" ✅ Added calibration dataset ({len(cal_train_dataset)} samples)")
# Configure algorithms using dictionary syntax as per documentation
for algo_type, algo_name in config_dict.items():
smash_config[algo_type] = algo_name
print(f" Configured {algo_type}: {algo_name}")
# Apply compression using correct Pruna API
compressed_pruna_model = smash(
model=model_copy,
smash_config=smash_config
)
# Extract the underlying PyTorch model from PrunaModel wrapper
if hasattr(compressed_pruna_model, 'model'):
compressed_model = compressed_pruna_model.model
elif hasattr(compressed_pruna_model, '_model'):
compressed_model = compressed_pruna_model._model
else:
# If we can't extract, use the PrunaModel directly for inference
compressed_model = compressed_pruna_model
# Ensure model is on correct device and in eval mode
compressed_model.to(device)
compressed_model.eval()
print(f"✅ Successfully applied {strategy_name}")
# Measure performance metrics
metrics = measure_inference_metrics(
compressed_model,
tokenized_dataset["test"],
device=device,
batch_size=32
)
temp_args = TrainingArguments(
output_dir="./temp",
per_device_eval_batch_size=64,
report_to="none"
)
temp_trainer = Trainer(
model=compressed_model,
args=temp_args,
eval_dataset=tokenized_dataset["test"],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
eval_metrics = temp_trainer.evaluate()
# Remove duplicate metrics
for metric in ['f1', 'precision', 'recall']:
if f"{metric}_macro" in metrics and f"eval_{metric}" in eval_metrics:
del metrics[f"{metric}_macro"]
# Combine all metrics
all_metrics = {**eval_metrics, **metrics}
results_by_strategy[strategy_name] = all_metrics
print(f" 📊 F1: {eval_metrics.get('eval_f1', 'N/A'):.4f}")
print(f" 🚀 Speed: {metrics.get('inference_speed (samples/sec)', 'N/A'):.2f} samples/sec")
print(f" 💾 Memory: {metrics.get('gpu_memory_used (MB)', 'N/A'):.2f} MB")
print(f" 🌍 Carbon: {metrics.get('carbon_footprint (kg CO2eq)', 'N/A'):.6f} kg CO2eq")
🚀 Applying Strategy: HALF PRECISION Configured quantizer: half
INFO - Starting quantizer half... INFO - quantizer half was applied successfully.
✅ Successfully applied half_precision
0%| | 0/238 [00:00<?, ?it/s]
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[54], line 61 58 print(f"✅ Successfully applied {strategy_name}") 60 # Measure performance metrics ---> 61 metrics = measure_inference_metrics( 62 compressed_model, 63 tokenized_dataset["test"], 64 device=device, 65 batch_size=32 66 ) 68 temp_args = TrainingArguments( 69 output_dir="./temp", 70 per_device_eval_batch_size=64, 71 report_to="none" 72 ) 74 temp_trainer = Trainer( 75 model=compressed_model, 76 args=temp_args, (...) 79 compute_metrics=compute_metrics 80 ) Cell In[52], line 53, in measure_inference_metrics(model, dataset, device, batch_size) 50 labels = batch["label"].to(device) 52 with torch.no_grad(): ---> 53 outputs = model(**inputs).logits 54 # Handle different output formats from compressed models 55 if hasattr(outputs, 'logits'): File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs) 1749 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1750 else: -> 1751 return self._call_impl(*args, **kwargs) File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs) 1757 # If we don't have any hooks, we want to skip the rest of the logic in 1758 # this function, and just call forward. 1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1760 or _global_backward_pre_hooks or _global_backward_hooks 1761 or _global_forward_hooks or _global_forward_pre_hooks): -> 1762 return forward_call(*args, **kwargs) 1764 result = None 1765 called_always_called_hooks = set() File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/pruna/algorithms/quantization/half.py:106, in HalfQuantizer._apply.<locals>.new_forward(*args, **kwargs) 104 args = tuple(arg.half() if hasattr(arg, "half") else arg for arg in args) 105 kwargs = {k: v.half() if hasattr(v, "half") else v for k, v in kwargs.items()} --> 106 return original_forward(*args, **kwargs) File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1675, in BertForSequenceClassification.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict) 1667 r""" 1668 labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): 1669 Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., 1670 config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If 1671 `config.num_labels > 1` a classification loss is computed (Cross-Entropy). 1672 """ 1673 return_dict = return_dict if return_dict is not None else self.config.use_return_dict -> 1675 outputs = self.bert( 1676 input_ids, 1677 attention_mask=attention_mask, 1678 token_type_ids=token_type_ids, 1679 position_ids=position_ids, 1680 head_mask=head_mask, 1681 inputs_embeds=inputs_embeds, 1682 output_attentions=output_attentions, 1683 output_hidden_states=output_hidden_states, 1684 return_dict=return_dict, 1685 ) 1687 pooled_output = outputs[1] 1689 pooled_output = self.dropout(pooled_output) File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs) 1749 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1750 else: -> 1751 return self._call_impl(*args, **kwargs) File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs) 1757 # If we don't have any hooks, we want to skip the rest of the logic in 1758 # this function, and just call forward. 1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1760 or _global_backward_pre_hooks or _global_backward_hooks 1761 or _global_forward_hooks or _global_forward_pre_hooks): -> 1762 return forward_call(*args, **kwargs) 1764 result = None 1765 called_always_called_hooks = set() File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1080, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict) 1077 else: 1078 token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device) -> 1080 embedding_output = self.embeddings( 1081 input_ids=input_ids, 1082 position_ids=position_ids, 1083 token_type_ids=token_type_ids, 1084 inputs_embeds=inputs_embeds, 1085 past_key_values_length=past_key_values_length, 1086 ) 1088 if attention_mask is None: 1089 attention_mask = torch.ones((batch_size, seq_length + past_key_values_length), device=device) File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs) 1749 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1750 else: -> 1751 return self._call_impl(*args, **kwargs) File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs) 1757 # If we don't have any hooks, we want to skip the rest of the logic in 1758 # this function, and just call forward. 1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1760 or _global_backward_pre_hooks or _global_backward_hooks 1761 or _global_forward_hooks or _global_forward_pre_hooks): -> 1762 return forward_call(*args, **kwargs) 1764 result = None 1765 called_always_called_hooks = set() File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:211, in BertEmbeddings.forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length) 208 token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device) 210 if inputs_embeds is None: --> 211 inputs_embeds = self.word_embeddings(input_ids) 212 token_type_embeddings = self.token_type_embeddings(token_type_ids) 214 embeddings = inputs_embeds + token_type_embeddings File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs) 1749 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1750 else: -> 1751 return self._call_impl(*args, **kwargs) File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs) 1757 # If we don't have any hooks, we want to skip the rest of the logic in 1758 # this function, and just call forward. 1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1760 or _global_backward_pre_hooks or _global_backward_hooks 1761 or _global_forward_hooks or _global_forward_pre_hooks): -> 1762 return forward_call(*args, **kwargs) 1764 result = None 1765 called_always_called_hooks = set() File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py:190, in Embedding.forward(self, input) 189 def forward(self, input: Tensor) -> Tensor: --> 190 return F.embedding( 191 input, 192 self.weight, 193 self.padding_idx, 194 self.max_norm, 195 self.norm_type, 196 self.scale_grad_by_freq, 197 self.sparse, 198 ) File ~/.cache/pypoetry/virtualenvs/bse-nlp-V5dzOmtQ-py3.10/lib/python3.10/site-packages/torch/nn/functional.py:2551, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 2545 # Note [embedding_renorm set_grad_enabled] 2546 # XXX: equivalent to 2547 # with torch.no_grad(): 2548 # torch.embedding_renorm_ 2549 # remove once script supports set_grad_enabled 2550 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type) -> 2551 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
# Check what's available in pruna package
import pruna
print("Available in pruna package:")
pruna_attrs = [attr for attr in dir(pruna) if not attr.startswith('_')]
for attr in pruna_attrs:
print(f" - {attr}")
# Check for common API patterns
print(f"\nChecking API structure:")
if hasattr(pruna, 'compress'):
print("✅ pruna.compress available")
if hasattr(pruna, 'optimize'):
print("✅ pruna.optimize available")
if hasattr(pruna, 'PruningConfig'):
print("✅ pruna.PruningConfig available")
if hasattr(pruna, 'prune'):
print("✅ pruna.prune available")
Available in pruna package: - PRUNA_ALGORITHMS - PrunaModel - SmashConfig - algorithms - config - data - engine - logging - smash - telemetry - version Checking API structure:
# Check what algorithms are available
print("Available Pruna algorithms:")
print(pruna.PRUNA_ALGORITHMS)
# Check what's in the smash module
print(f"\nSmash module contents:")
smash_attrs = [attr for attr in dir(pruna.smash) if not attr.startswith('_')]
for attr in smash_attrs:
print(f" - {attr}")
Available Pruna algorithms:
{'factorizer': {'qkv_diffusers': <pruna.algorithms.factorizing.qkv_diffusers.QKVDiffusers object at 0x7e46349556f0>}, 'pruner': {'torch_structured': <pruna.algorithms.pruning.torch_structured.TorchStructuredPruner object at 0x7e4634957fd0>, 'torch_unstructured': <pruna.algorithms.pruning.torch_unstructured.TorchUnstructuredPruner object at 0x7e46343adbd0>}, 'quantizer': {'gptq': <pruna.algorithms.quantization.gptq_model.GPTQQuantizer object at 0x7e4627f44160>, 'half': <pruna.algorithms.quantization.half.HalfQuantizer object at 0x7e46343aceb0>, 'hqq': <pruna.algorithms.quantization.hqq.HQQQuantizer object at 0x7e4634124610>, 'hqq_diffusers': <pruna.algorithms.quantization.hqq_diffusers.HQQDiffusersQuantizer object at 0x7e4627f034c0>, 'awq': <pruna.algorithms.quantization.huggingface_awq.AWQQuantizer object at 0x7e46343983a0>, 'diffusers_int8': <pruna.algorithms.quantization.huggingface_diffusers_int8.DiffusersInt8Quantizer object at 0x7e4634398370>, 'llm_int8': <pruna.algorithms.quantization.huggingface_llm_int8.LLMInt8Quantizer object at 0x7e4634646ce0>, 'quanto': <pruna.algorithms.quantization.quanto.QuantoQuantizer object at 0x7e4626b14e50>, 'torch_dynamic': <pruna.algorithms.quantization.torch_dynamic.TorchDynamicQuantizer object at 0x7e4626be4a60>, 'torch_static': <pruna.algorithms.quantization.torch_static.TorchStaticQuantizer object at 0x7e46342239d0>, 'torchao': <pruna.algorithms.quantization.torchao.TorchaoQuantizer object at 0x7e4634583370>}, 'cacher': {'deepcache': <pruna.algorithms.caching.deepcache.DeepCacheCacher object at 0x7e46360faa70>, 'fastercache': <pruna.algorithms.caching.fastercache.FasterCacheCacher object at 0x7e46360d7a60>, 'fora': <pruna.algorithms.caching.fora.FORACacher object at 0x7e46349871f0>, 'pab': <pruna.algorithms.caching.pab.PABCacher object at 0x7e46363b8040>}, 'compiler': {'c_generate': <pruna.algorithms.compilation.c_translate.CGenerateCompiler object at 0x7e46361be740>, 'c_translate': <pruna.algorithms.compilation.c_translate.CTranslateCompiler object at 0x7e46361bd690>, 'c_whisper': <pruna.algorithms.compilation.c_translate.CWhisperCompiler object at 0x7e46361be500>, 'stable_fast': <pruna.algorithms.compilation.stable_fast.StableFastCompiler object at 0x7e463476ac20>, 'torch_compile': <pruna.algorithms.compilation.torch_compile.TorchCompileCompiler object at 0x7e46361bdf30>}, 'batcher': {'ifw': <pruna.algorithms.batching.ifw.IFWBatcher object at 0x7e46363a7790>, 'whisper_s2t': <pruna.algorithms.batching.ws2t.WS2TBatcher object at 0x7e463487d840>}}
Smash module contents:
# Check SmashConfig parameters
from pruna import SmashConfig
import inspect
print("SmashConfig constructor signature:")
print(inspect.signature(SmashConfig.__init__))
# Check SmashConfig documentation/attributes
print("\nSmashConfig attributes:")
config_attrs = [attr for attr in dir(SmashConfig) if not attr.startswith('_')]
for attr in config_attrs:
print(f" - {attr}")
SmashConfig constructor signature: (self, max_batch_size: 'int | None' = None, batch_size: 'int' = 1, device: 'str | torch.device | None' = None, cache_dir_prefix: 'str' = '/home/ubuntu/.cache/pruna', configuration: 'Configuration | None' = None) -> 'None' SmashConfig attributes: - add_data - add_processor - add_tokenizer - cleanup_cache_dir - flush_configuration - get_tokenizer_name - is_batch_size_locked - load_dict - load_from_json - lock_batch_size - reset_cache_dir - save_to_json - test_dataloader - train_dataloader - val_dataloader
# Check pruna.smash function signature
import inspect
print("pruna.smash signature:")
try:
print(inspect.signature(pruna.smash))
except:
print("Could not get signature, let's check attributes:")
print([attr for attr in dir(pruna.smash) if not attr.startswith('_')])
pruna.smash signature: (model: Any, smash_config: pruna.config.smash_config.SmashConfig, verbose: bool = False, experimental: bool = False) -> pruna.engine.pruna_model.PrunaModel