BERT Implementation with HF vs LSTM Text Classification vs. TF-IDF + Logistic Regression¶

In this notebook, we compare three different approaches to text classification:

Logistic Regression on top of TF-IDF features
LSTM (using Keras) trained end-to-end
BERT (using Hugging Face) trained end-to-end

We'll explore:

The architectures for each approach
Their respective training processes
Visualization of results and metrics
Potential improvements for the BERT model

1. Environment Setup & Imports¶

We'll install any dependencies (if needed) and import the necessary libraries.

In [1]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

import os
os.environ["TF_USE_LEGACY_KERAS"] = "1" # This is needed for the LSTM model on GPU
os.environ["KERAS_BACKEND"] = "tensorflow"  # Use TensorFlow backend

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          DataCollatorWithPadding, TrainingArguments, Trainer)
import evaluate, torch, os, random, json

np.random.seed(42)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

import os
os.environ["TF_USE_LEGACY_KERAS"] = "1" # This is needed for the LSTM model on GPU
os.environ["KERAS_BACKEND"] = "tensorflow"  # Use TensorFlow backend

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          DataCollatorWithPadding, TrainingArguments, Trainer)
import evaluate, torch, os, random, json

np.random.seed(42)

2025-05-05 11:11:03.608750: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-05 11:11:03.643424: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-05-05 11:11:03.643448: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-05-05 11:11:03.644375: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-05 11:11:03.650611: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-05-05 11:11:04.384736: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

2. Data Loading & Preprocessing¶

For demonstration, we'll assume we have a text classification dataset named df. It contains two columns:

text : The text field
label: The target label (0 or 1, for binary classification, or more classes)

We'll split the data into train and test sets, then further create a small dev set if needed.

In [2]:

Copied!





# Load the imdb
dataset = load_dataset('imdb')
print(dataset)


# Convert to pandas DataFrame for easier manipulation
train_df = dataset['train'].shuffle(seed=42).to_pandas()
test_df = dataset['test'].shuffle(seed=42).to_pandas()
# Clean the memory
del dataset
# Load the imdb
dataset = load_dataset('imdb')
print(dataset)


# Convert to pandas DataFrame for easier manipulation
train_df = dataset['train'].shuffle(seed=42).to_pandas()
test_df = dataset['test'].shuffle(seed=42).to_pandas()
# Clean the memory
del dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

3. TF-IDF + Logistic Regression¶

We'll:

Vectorize our text using TF-IDF.
Train a Logistic Regression model.
Evaluate on the test set.
Visualize results.

In [3]:

Copied!

# 3.1 Vectorize text with TF-IDF
tfidf_vec = TfidfVectorizer(stop_words='english', min_df=5, max_features=20000)

X_train_tfidf = tfidf_vec.fit_transform(train_df['text'])
y_train = train_df['label'].values

X_test_tfidf = tfidf_vec.transform(test_df['text'])
y_test = test_df['label'].values

print("X_train shape:", X_train_tfidf.shape, ", y_train shape:", y_train.shape)
# 3.1 Vectorize text with TF-IDF
tfidf_vec = TfidfVectorizer(stop_words='english', min_df=5, max_features=20000)

X_train_tfidf = tfidf_vec.fit_transform(train_df['text'])
y_train = train_df['label'].values

X_test_tfidf = tfidf_vec.transform(test_df['text'])
y_test = test_df['label'].values

print("X_train shape:", X_train_tfidf.shape, ", y_train shape:", y_train.shape)

X_train shape: (25000, 20000) , y_train shape: (25000,)

In [4]:

Copied!





# 3.2 Train a Logistic Regression model
logreg = LogisticRegression()
logreg.fit(X_train_tfidf, y_train)

# Evaluate on test
y_pred_logreg = logreg.predict(X_test_tfidf)

print("LogisticRegression Test Classification Report:")
print(classification_report(y_test, y_pred_logreg))
cm_logreg = confusion_matrix(y_test, y_pred_logreg)
print("Confusion Matrix:", cm_logreg)
# 3.2 Train a Logistic Regression model
logreg = LogisticRegression()
logreg.fit(X_train_tfidf, y_train)

# Evaluate on test
y_pred_logreg = logreg.predict(X_test_tfidf)

print("LogisticRegression Test Classification Report:")
print(classification_report(y_test, y_pred_logreg))
cm_logreg = confusion_matrix(y_test, y_pred_logreg)
print("Confusion Matrix:", cm_logreg)

LogisticRegression Test Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.88      0.88     12500
           1       0.88      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000

Confusion Matrix: [[10985  1515]
 [ 1479 11021]]

Visualization of Confusion Matrix¶

In [ ]:

Copied!





plt.figure(figsize=(4,4))
sns.heatmap(cm_logreg, annot=True, cmap='Blues', fmt='d', cbar=False)
plt.title("LogReg TF-IDF - Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()
plt.figure(figsize=(4,4))
sns.heatmap(cm_logreg, annot=True, cmap='Blues', fmt='d', cbar=False)
plt.title("LogReg TF-IDF - Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()

No description has been provided for this image

4: LSTM Classification¶

For the whole explanation on the LSTM classification, please refer to the Session 4.

Here we will directly train the model.

4.1 Preprocessing the Text for LSTM¶

In [6]:

Copied!





from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

#Create the tokenizer with the same vocabulary as the TF-IDF vectorizer
custom_vocab = tfidf_vec.vocabulary_
vocab_size = len(custom_vocab) + 1 # +1 for the OOV token
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")

# Manually assign word index
tokenizer.word_index = {word: i for i, word in enumerate(custom_vocab)}
tokenizer.word_index[tokenizer.oov_token] = len(custom_vocab)
print(len(tokenizer.word_index))

##Let's increase the max length of the sequences to 128
max_len = 128

# Let's create a function to convert the text to a sequence of token IDs and pad them to the same length: 128

def text_to_seq(df_col, max_len=128):
    seqs = tokenizer.texts_to_sequences(df_col)
    # pad
    seqs_padded = pad_sequences(seqs, maxlen=max_len, padding='post', truncating='post')
    return seqs_padded

X_train_seq = text_to_seq(train_df['text'].map(lambda x: x.lower()), max_len=max_len)
X_test_seq = text_to_seq(test_df['text'].map(lambda x: x.lower()), max_len=max_len)

y_train_lstm = train_df['label'].values
y_test_lstm  = test_df['label'].values

X_train_seq.shape, y_train_lstm.shape
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

#Create the tokenizer with the same vocabulary as the TF-IDF vectorizer
custom_vocab = tfidf_vec.vocabulary_
vocab_size = len(custom_vocab) + 1 # +1 for the OOV token
tokenizer = Tokenizer(num_words=vocab_size, oov_token="")

# Manually assign word index
tokenizer.word_index = {word: i for i, word in enumerate(custom_vocab)}
tokenizer.word_index[tokenizer.oov_token] = len(custom_vocab)
print(len(tokenizer.word_index))

##Let's increase the max length of the sequences to 128
max_len = 128

# Let's create a function to convert the text to a sequence of token IDs and pad them to the same length: 128

def text_to_seq(df_col, max_len=128):
    seqs = tokenizer.texts_to_sequences(df_col)
    # pad
    seqs_padded = pad_sequences(seqs, maxlen=max_len, padding='post', truncating='post')
    return seqs_padded

X_train_seq = text_to_seq(train_df['text'].map(lambda x: x.lower()), max_len=max_len)
X_test_seq = text_to_seq(test_df['text'].map(lambda x: x.lower()), max_len=max_len)

y_train_lstm = train_df['label'].values
y_test_lstm  = test_df['label'].values

X_train_seq.shape, y_train_lstm.shape

Out[6]:

((25000, 128), (25000,))

4.2 Build the LSTM Model¶

Now that we have our sequences ready and padded, we can build an LSTM-based neural network using Keras.

We’ll use a simple architecture:

An Embedding layer to learn word representations.
A single LSTM layer to process the sequence.
A Dense layer with a sigmoid activation to output binary predictions.

The optimizer is Adam with a learning rate of 0.01.

In [7]:

Copied!





import keras
from keras import layers, regularizers

embedding_dim = 32
inputs = keras.Input(shape=(max_len,), dtype="int32")
x = layers.Embedding(vocab_size, embedding_dim)(inputs)

# LSTM with regularizer
x = layers.LSTM(32,
               return_sequences=True,
               kernel_regularizer=regularizers.L2(0.01))(x)

x = layers.Flatten()(x)
x = layers.Dropout(0.15)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()
import keras
from keras import layers, regularizers

embedding_dim = 32
inputs = keras.Input(shape=(max_len,), dtype="int32")
x = layers.Embedding(vocab_size, embedding_dim)(inputs)

# LSTM with regularizer
x = layers.LSTM(32,
               return_sequences=True,
               kernel_regularizer=regularizers.L2(0.01))(x)

x = layers.Flatten()(x)
x = layers.Dropout(0.15)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 128)]             0         
                                                                 
 embedding (Embedding)       (None, 128, 32)           640032    
                                                                 
 lstm (LSTM)                 (None, 128, 32)           8320

2025-05-05 11:11:25.038066: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-05-05 11:11:25.042701: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2256] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

 flatten (Flatten)           (None, 4096)              0         
                                                                 
 dropout (Dropout)           (None, 4096)              0         
                                                                 
 dense (Dense)               (None, 1)                 4097      
                                                                 
=================================================================
Total params: 652449 (2.49 MB)
Trainable params: 652449 (2.49 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

4.3 Train the LSTM Model¶

Now we train the model using our preprocessed sequences and binary labels.

We use:

validation_split=0.2: to hold out 20% of training data for validation.
epochs=10: to iterate 10 times over the dataset.
batch_size=128: A high batch size to speed up training and also to reduce overfitting.

Note that LSTMs are more computationally expensive than logistic regression, so training can take longer — especially with small batches.

To control the overfitting we can use the callback EarlyStopping to stop the training if the validation loss does not improve for 3 epochs.

In [8]:

Copied!





from tensorflow.keras.metrics import Recall, Precision

optimizer = keras.optimizers.Adam(learning_rate=0.001)

# Then compile with the correct metric objects
model.compile(
    optimizer=optimizer,
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    metrics=['accuracy', Recall(), Precision()]
)
from tensorflow.keras.metrics import Recall, Precision

optimizer = keras.optimizers.Adam(learning_rate=0.001)

# Then compile with the correct metric objects
model.compile(
    optimizer=optimizer,
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    metrics=['accuracy', Recall(), Precision()]
)

In [9]:

Copied!





from tensorflow.keras.callbacks import EarlyStopping

history = model.fit(
    X_train_seq, y_train_lstm,
    validation_split=0.2,
    epochs=20,
    batch_size=128,
    verbose=1,
    callbacks=[EarlyStopping(monitor='val_loss', patience=3)]
)
from tensorflow.keras.callbacks import EarlyStopping

history = model.fit(
    X_train_seq, y_train_lstm,
    validation_split=0.2,
    epochs=20,
    batch_size=128,
    verbose=1,
    callbacks=[EarlyStopping(monitor='val_loss', patience=3)]
)

Epoch 1/20

157/157 [==============================] - 10s 39ms/step - loss: 0.7218 - accuracy: 0.6925 - recall: 0.6582 - precision: 0.5813 - val_loss: 0.4788 - val_accuracy: 0.8064 - val_recall: 0.6960 - val_precision: 0.7081
Epoch 2/20
157/157 [==============================] - 6s 36ms/step - loss: 0.2989 - accuracy: 0.8913 - recall: 0.7491 - precision: 0.7688 - val_loss: 0.3881 - val_accuracy: 0.8410 - val_recall: 0.7881 - val_precision: 0.8005
Epoch 3/20
157/157 [==============================] - 6s 36ms/step - loss: 0.1876 - accuracy: 0.9392 - recall: 0.8165 - precision: 0.8247 - val_loss: 0.4466 - val_accuracy: 0.8336 - val_recall: 0.8351 - val_precision: 0.8405
Epoch 4/20
157/157 [==============================] - 6s 36ms/step - loss: 0.1205 - accuracy: 0.9647 - recall: 0.8514 - precision: 0.8543 - val_loss: 0.6309 - val_accuracy: 0.8064 - val_recall: 0.8590 - val_precision: 0.8662
Epoch 5/20
157/157 [==============================] - 6s 36ms/step - loss: 0.0881 - accuracy: 0.9752 - recall: 0.8662 - precision: 0.8768 - val_loss: 0.6237 - val_accuracy: 0.8214 - val_recall: 0.8742 - val_precision: 0.8835

4.4 Training Curves¶

Let's visualize the training and validation accuracy/loss.

In [12]:

Copied!





plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.legend()
plt.title('LSTM Loss')

plt.subplot(1,2,2)
plt.plot(history.history['accuracy'], label='Train Acc')
plt.plot(history.history['val_accuracy'], label='Val Acc')
plt.legend()
plt.title('LSTM Accuracy')
plt.show()
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.legend()
plt.title('LSTM Loss')

plt.subplot(1,2,2)
plt.plot(history.history['accuracy'], label='Train Acc')
plt.plot(history.history['val_accuracy'], label='Val Acc')
plt.legend()
plt.title('LSTM Accuracy')
plt.show()

4.5 Evaluate on Test Data¶

In [11]:

Copied!





y_pred_lstm_prob = model.predict(X_test_seq)
y_pred_lstm = (y_pred_lstm_prob>0.5).astype(int)
print("LSTM Classification Report:")
print(classification_report(y_test_lstm, y_pred_lstm))
y_pred_lstm_prob = model.predict(X_test_seq)
y_pred_lstm = (y_pred_lstm_prob>0.5).astype(int)
print("LSTM Classification Report:")
print(classification_report(y_test_lstm, y_pred_lstm))

782/782 [==============================] - 5s 6ms/step
LSTM Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.78      0.78     12500
           1       0.78      0.78      0.78     12500

    accuracy                           0.78     25000
   macro avg       0.78      0.78      0.78     25000
weighted avg       0.78      0.78      0.78     25000

5 : BERT (Text Classification)¶

In this section we'll fine-tune BERT (Base-uncased) – a bidirectional transformer that already knows an enormous amount about language from pre-training on Wikipedia + Books.

🤔 Why Move From LSTM to BERT?¶

	LSTM + Embeddings	BERT (Transformer)
Context Window	Limited, sequential	Global (self-attention)
Pre-training	Usually random embeddings	110 M parameters pre-trained on ~3 B words
Handles Long Ranges?	Struggles	Much better (all-to-all attention)
Training Data Needed	More (from scratch)	Less (only fine-tune)
Compute / Memory	Lower	Higher

Absolutely! Here's a warm, student-friendly introductory Markdown cell that should go above the one I gave earlier — ideal for Master’s students encountering BERT for the first time:

🤖 What is BERT?¶

BERT stands for Bidirectional Encoder Representations from Transformers. It was introduced by Google in 2018 and changed the game by allowing models to understand context in both directions — left-to-right and right-to-left — at once.

This means:

It understands that in "bank account" and "river bank", the word "bank" means different things.
It can be pre-trained on massive text data and then fine-tuned on small, specific tasks like sentiment analysis, spam detection, or question answering.

🧩 What Are We Doing here?¶

We’ll take a pre-trained BERT model, add a small classification head on top, and fine-tune it to solve a binary classification problem (e.g. spam vs. ham, or positive vs. negative).

You don’t need to train BERT from scratch. Why?

💡 Because it was already trained on massive corpora like Wikipedia and BooksCorpus — that’s months of compute time we’re saving!

🔍 Where Do Pretrained Models Come From?¶

We use the 🤗 Hugging Face Hub — it’s the “GitHub for machine learning models”. There you’ll find thousands of pre-trained models shared by the community, ready to be fine-tuned for your task.

You can explore and download models like:

bert-base-uncased (classic BERT)
distilbert-base-uncased (smaller, faster)
roberta-base, albert, tinyBERT, ModernBERT, etc.

Just replace the model name in your code — it’s that easy!

⚡ Why You Should Use a GPU¶

Training large transformer models like BERT is computationally expensive. Without a GPU, training might take hours or even days — especially with large datasets.

Using a GPU (e.g. on Google Colab, Kaggle, or your university’s cluster) makes training:

⏱️ Much faster
🔁 More suitable for experimentation
🔬 Better for deep model introspection and visualization

You can check GPU availability with:

In [ ]:

Copied!

import torch
print(torch.cuda.is_available())
import torch
print(torch.cuda.is_available())

🧰 Setup & Imports¶

Before diving into BERT fine-tuning, we start by loading the necessary libraries and defining key training parameters:

🤗 Hugging Face Transformers: For model and tokenizer.
🤗 Datasets: To handle data loading and formatting.
Trainer API: For managing the training loop and evaluation.
EarlyStoppingCallback: To stop training once performance stops improving.
evaluate: For downstream metrics (like accuracy, F1...).
Seed initialization: To ensure reproducibility.

🔧 Key Configuration Parameters¶

model_ckpt: We’ll use "bert-base-uncased" as our pre-trained model. You can replace it with smaller models like "microsoft/MiniLM-L6..." for efficiency.
num_labels: We’re doing binary classification (2 labels).
max_length: Input sequences are padded or truncated to 128 tokens.
batch_size: Controls how many samples are processed in one training step.
seed: Sets the random seed for reproducibility.

We also define a set_seed function to ensure our experiments are deterministic across runs.

In [2]:

Copied!





from datasets import load_dataset, DatasetDict
from transformers import (AutoTokenizer,
                          AutoModelForSequenceClassification,
                          TrainingArguments, Trainer,
                          EarlyStoppingCallback)
import evaluate, numpy as np, matplotlib.pyplot as plt
from pathlib import Path
import torch, random, os

model_ckpt   = "bert-base-uncased"   # ✔️ swap for "microsoft/MiniLM-L6..." if you need something lighter
num_labels   = 2                     # binary classification
max_length   = 128                   # truncate / pad length
batch_size   = 64
seed         = 42

def set_seed(seed):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
set_seed(seed)
from datasets import load_dataset, DatasetDict
from transformers import (AutoTokenizer,
                          AutoModelForSequenceClassification,
                          TrainingArguments, Trainer,
                          EarlyStoppingCallback)
import evaluate, numpy as np, matplotlib.pyplot as plt
from pathlib import Path
import torch, random, os

model_ckpt   = "bert-base-uncased"   # ✔️ swap for "microsoft/MiniLM-L6..." if you need something lighter
num_labels   = 2                     # binary classification
max_length   = 128                   # truncate / pad length
batch_size   = 64
seed         = 42

def set_seed(seed):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
set_seed(seed)

5.1 Pre-Processing (Tokenizer)¶

Here’s a clear and student-friendly Markdown block that should go above this cell. It explains what's happening in an intuitive way, including why tokenization is critical when working with BERT:

🧪 Tokenize the Text (BERT Can’t Read Raw Text!)¶

BERT doesn't read words — it reads tokens. So we need to:

Convert text into BERT-readable format using a tokenizer.
Truncate or pad each sentence to a fixed length (max_length=128).!

Before we batch-process the whole dataset, let’s look at what happens to one sentence during BERT tokenization.

We’ll use the BERT tokenizer (bert-base-uncased) to show the key steps:

✏️ Example: A Single Sentence¶

Let’s say we have this review:

🔢 Step 1: Tokenization into Subwords¶

BERT uses WordPiece tokenization — it breaks rare or unknown words into subword units:

Some words (like "surprisingly") may get split into subwords like ['surprising', '##ly']. The ## prefix means: “this subword continues a previous token.”

In [16]:

Copied!





# Load your dataset -----------------------------------------------------------
raw = load_dataset("imdb", split={"train":"train", "test":"test"})
dataset = DatasetDict(train=raw["train"], test=raw["test"])

# Tokeniser -------------------------------------------------------------------
tok = AutoTokenizer.from_pretrained(model_ckpt)
text = dataset['train'][1]['text']
print(text)
# Load your dataset -----------------------------------------------------------
raw = load_dataset("imdb", split={"train":"train", "test":"test"})
dataset = DatasetDict(train=raw["train"], test=raw["test"])

# Tokeniser -------------------------------------------------------------------
tok = AutoTokenizer.from_pretrained(model_ckpt)
text = dataset['train'][1]['text']
print(text)

"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don't exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we're treated to the site of Vincent Gallo's throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, and the same cannot be said for a man. In fact, you generally won't see female genitals in an American film in anything short of porn or explicit erotica. This alleged double-standard is less a double standard than an admittedly depressing ability to come to terms culturally with the insides of women's bodies.

In [17]:

Copied!





text = dataset['train'][1]['text']
print(text)
tokens = tok.tokenize(text)
print(tokens)
text = dataset['train'][1]['text']
print(text)
tokens = tok.tokenize(text)
print(tokens)

"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don't exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we're treated to the site of Vincent Gallo's throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, and the same cannot be said for a man. In fact, you generally won't see female genitals in an American film in anything short of porn or explicit erotica. This alleged double-standard is less a double standard than an admittedly depressing ability to come to terms culturally with the insides of women's bodies.
['[CLS]', '"', 'i', 'am', 'curious', ':', 'yellow', '"', 'is', 'a', 'ri', '##sible', 'and', 'pre', '##ten', '##tious', 'steaming', 'pile', '.', 'it', 'doesn', "'", 't', 'matter', 'what', 'one', "'", 's', 'political', 'views', 'are', 'because', 'this', 'film', 'can', 'hardly', 'be', 'taken', 'seriously', 'on', 'any', 'level', '.', 'as', 'for', 'the', 'claim', 'that', 'frontal', 'male', 'nu', '##dity', 'is', 'an', 'automatic', 'nc', '-', '17', ',', 'that', 'isn', "'", 't', 'true', '.', 'i', "'", 've', 'seen', 'r', '-', 'rated', 'films', 'with', 'male', 'nu', '##dity', '.', 'granted', ',', 'they', 'only', 'offer', 'some', 'fleeting', 'views', ',', 'but', 'where', 'are', 'the', 'r', '-', 'rated', 'films', 'with', 'gaping', 'vu', '##lva', '##s', 'and', 'flap', '##ping', 'lab', '##ia', '?', 'nowhere', ',', 'because', 'they', 'don', "'", 't', 'exist', '.', 'the', 'same', 'goes', 'for', 'those', 'crap', '##py', 'cable', 'shows', ':', 'sc', '##hl', '##ong', '##s', 'swinging', 'in', 'the', 'breeze', 'but', 'not', 'a', 'clit', '##oris', 'in', 'sight', '.', 'and', 'those', 'pre', '##ten', '##tious', 'indie', 'movies', 'like', 'the', 'brown', 'bunny', ',', 'in', 'which', 'we', "'", 're', 'treated', 'to', 'the', 'site', 'of', 'vincent', 'gallo', "'", 's', 'throbbing', 'johnson', ',', 'but', 'not', 'a', 'trace', 'of', 'pink', 'visible', 'on', 'chloe', 'se', '##vi', '##gny', '.', 'before', 'crying', '(', 'or', 'implying', ')', '"', 'double', '-', 'standard', '"', 'in', 'matters', 'of', 'nu', '##dity', ',', 'the', 'mentally', 'ob', '##tus', '##e', 'should', 'take', 'into', 'account', 'one', 'una', '##vo', '##ida', '##bly', 'obvious', 'anatomical', 'difference', 'between', 'men', 'and', 'women', ':', 'there', 'are', 'no', 'gen', '##ital', '##s', 'on', 'display', 'when', 'actresses', 'appears', 'nude', ',', 'and', 'the', 'same', 'cannot', 'be', 'said', 'for', 'a', 'man', '.', 'in', 'fact', ',', 'you', 'generally', 'won', "'", 't', 'see', 'female', 'gen', '##ital', '##s', 'in', 'an', 'american', 'film', 'in', 'anything', 'short', 'of', 'porn', 'or', 'explicit', 'erotic', '##a', '.', 'this', 'alleged', 'double', '-', 'standard', 'is', 'less', 'a', 'double', 'standard', 'than', 'an', 'admitted', '##ly', 'de', '##pressing', 'ability', 'to', 'come', 'to', 'terms', 'culturally', 'with', 'the', 'insides', 'of', 'women', "'", 's', 'bodies', '.', '[SEP]']

🔁 Step 2: Add Special Tokens¶

BERT needs special tokens:

[CLS] at the start: stands for "classification" and is used as the summary vector.
[SEP] at the end: separates sequences (useful for sentence pairs).

In [18]:

Copied!





print(text)
tokens = tok.tokenize(text)
print(tokens)
tokens = tok.tokenize(text, add_special_tokens=True)
print(tokens)
print(text)
tokens = tok.tokenize(text)
print(tokens)
tokens = tok.tokenize(text, add_special_tokens=True)
print(tokens)

"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don't exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we're treated to the site of Vincent Gallo's throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, and the same cannot be said for a man. In fact, you generally won't see female genitals in an American film in anything short of porn or explicit erotica. This alleged double-standard is less a double standard than an admittedly depressing ability to come to terms culturally with the insides of women's bodies.
['"', 'i', 'am', 'curious', ':', 'yellow', '"', 'is', 'a', 'ri', '##sible', 'and', 'pre', '##ten', '##tious', 'steaming', 'pile', '.', 'it', 'doesn', "'", 't', 'matter', 'what', 'one', "'", 's', 'political', 'views', 'are', 'because', 'this', 'film', 'can', 'hardly', 'be', 'taken', 'seriously', 'on', 'any', 'level', '.', 'as', 'for', 'the', 'claim', 'that', 'frontal', 'male', 'nu', '##dity', 'is', 'an', 'automatic', 'nc', '-', '17', ',', 'that', 'isn', "'", 't', 'true', '.', 'i', "'", 've', 'seen', 'r', '-', 'rated', 'films', 'with', 'male', 'nu', '##dity', '.', 'granted', ',', 'they', 'only', 'offer', 'some', 'fleeting', 'views', ',', 'but', 'where', 'are', 'the', 'r', '-', 'rated', 'films', 'with', 'gaping', 'vu', '##lva', '##s', 'and', 'flap', '##ping', 'lab', '##ia', '?', 'nowhere', ',', 'because', 'they', 'don', "'", 't', 'exist', '.', 'the', 'same', 'goes', 'for', 'those', 'crap', '##py', 'cable', 'shows', ':', 'sc', '##hl', '##ong', '##s', 'swinging', 'in', 'the', 'breeze', 'but', 'not', 'a', 'clit', '##oris', 'in', 'sight', '.', 'and', 'those', 'pre', '##ten', '##tious', 'indie', 'movies', 'like', 'the', 'brown', 'bunny', ',', 'in', 'which', 'we', "'", 're', 'treated', 'to', 'the', 'site', 'of', 'vincent', 'gallo', "'", 's', 'throbbing', 'johnson', ',', 'but', 'not', 'a', 'trace', 'of', 'pink', 'visible', 'on', 'chloe', 'se', '##vi', '##gny', '.', 'before', 'crying', '(', 'or', 'implying', ')', '"', 'double', '-', 'standard', '"', 'in', 'matters', 'of', 'nu', '##dity', ',', 'the', 'mentally', 'ob', '##tus', '##e', 'should', 'take', 'into', 'account', 'one', 'una', '##vo', '##ida', '##bly', 'obvious', 'anatomical', 'difference', 'between', 'men', 'and', 'women', ':', 'there', 'are', 'no', 'gen', '##ital', '##s', 'on', 'display', 'when', 'actresses', 'appears', 'nude', ',', 'and', 'the', 'same', 'cannot', 'be', 'said', 'for', 'a', 'man', '.', 'in', 'fact', ',', 'you', 'generally', 'won', "'", 't', 'see', 'female', 'gen', '##ital', '##s', 'in', 'an', 'american', 'film', 'in', 'anything', 'short', 'of', 'porn', 'or', 'explicit', 'erotic', '##a', '.', 'this', 'alleged', 'double', '-', 'standard', 'is', 'less', 'a', 'double', 'standard', 'than', 'an', 'admitted', '##ly', 'de', '##pressing', 'ability', 'to', 'come', 'to', 'terms', 'culturally', 'with', 'the', 'insides', 'of', 'women', "'", 's', 'bodies', '.']
['[CLS]', '"', 'i', 'am', 'curious', ':', 'yellow', '"', 'is', 'a', 'ri', '##sible', 'and', 'pre', '##ten', '##tious', 'steaming', 'pile', '.', 'it', 'doesn', "'", 't', 'matter', 'what', 'one', "'", 's', 'political', 'views', 'are', 'because', 'this', 'film', 'can', 'hardly', 'be', 'taken', 'seriously', 'on', 'any', 'level', '.', 'as', 'for', 'the', 'claim', 'that', 'frontal', 'male', 'nu', '##dity', 'is', 'an', 'automatic', 'nc', '-', '17', ',', 'that', 'isn', "'", 't', 'true', '.', 'i', "'", 've', 'seen', 'r', '-', 'rated', 'films', 'with', 'male', 'nu', '##dity', '.', 'granted', ',', 'they', 'only', 'offer', 'some', 'fleeting', 'views', ',', 'but', 'where', 'are', 'the', 'r', '-', 'rated', 'films', 'with', 'gaping', 'vu', '##lva', '##s', 'and', 'flap', '##ping', 'lab', '##ia', '?', 'nowhere', ',', 'because', 'they', 'don', "'", 't', 'exist', '.', 'the', 'same', 'goes', 'for', 'those', 'crap', '##py', 'cable', 'shows', ':', 'sc', '##hl', '##ong', '##s', 'swinging', 'in', 'the', 'breeze', 'but', 'not', 'a', 'clit', '##oris', 'in', 'sight', '.', 'and', 'those', 'pre', '##ten', '##tious', 'indie', 'movies', 'like', 'the', 'brown', 'bunny', ',', 'in', 'which', 'we', "'", 're', 'treated', 'to', 'the', 'site', 'of', 'vincent', 'gallo', "'", 's', 'throbbing', 'johnson', ',', 'but', 'not', 'a', 'trace', 'of', 'pink', 'visible', 'on', 'chloe', 'se', '##vi', '##gny', '.', 'before', 'crying', '(', 'or', 'implying', ')', '"', 'double', '-', 'standard', '"', 'in', 'matters', 'of', 'nu', '##dity', ',', 'the', 'mentally', 'ob', '##tus', '##e', 'should', 'take', 'into', 'account', 'one', 'una', '##vo', '##ida', '##bly', 'obvious', 'anatomical', 'difference', 'between', 'men', 'and', 'women', ':', 'there', 'are', 'no', 'gen', '##ital', '##s', 'on', 'display', 'when', 'actresses', 'appears', 'nude', ',', 'and', 'the', 'same', 'cannot', 'be', 'said', 'for', 'a', 'man', '.', 'in', 'fact', ',', 'you', 'generally', 'won', "'", 't', 'see', 'female', 'gen', '##ital', '##s', 'in', 'an', 'american', 'film', 'in', 'anything', 'short', 'of', 'porn', 'or', 'explicit', 'erotic', '##a', '.', 'this', 'alleged', 'double', '-', 'standard', 'is', 'less', 'a', 'double', 'standard', 'than', 'an', 'admitted', '##ly', 'de', '##pressing', 'ability', 'to', 'come', 'to', 'terms', 'culturally', 'with', 'the', 'insides', 'of', 'women', "'", 's', 'bodies', '.', '[SEP]']

🔢 Step 3: Convert Tokens to IDs¶

After breaking the text into tokens (or subword units), BERT still can’t work directly with text. It needs numeric IDs that correspond to each token’s position in its internal vocabulary.

These IDs are just unique integers representing each token in the model’s 30,000+ word vocabulary.

🧪 Example:¶

Let’s say our tokenized review looks like this:

['[CLS]', 'the', 'movie', 'was', 'surprisingly', 'good', '!', '[SEP]']

When we call:

token_ids = tok.convert_tokens_to_ids(tokens)

We might get something like:

[101, 1996, 3185, 2001, 6239, 2204, 999, 102]

Each number is the index of the token in the BERT vocabulary. For instance:

'[CLS]' → 101
'the' → 1996
'movie' → 3185

These token IDs are what the BERT model actually sees — they’re passed as input to its embedding layer, which looks up a vector for each ID and starts building meaning from there.

💡 Important: This process ensures that text can be converted into tensors and used in neural networks efficiently.

In [26]:

Copied!





print(text)
tokens = tok.tokenize(text)
print("length of tokens: ", len(tokens))
print(tokens)
tokens = tok.tokenize(text, add_special_tokens=True)
print("length of tokens: ", len(tokens))
print(tokens)
token_ids = tok.convert_tokens_to_ids(tokens)
print("length of token_ids: ", len(token_ids))
print(token_ids)
print(text)
tokens = tok.tokenize(text)
print("length of tokens: ", len(tokens))
print(tokens)
tokens = tok.tokenize(text, add_special_tokens=True)
print("length of tokens: ", len(tokens))
print(tokens)
token_ids = tok.convert_tokens_to_ids(tokens)
print("length of token_ids: ", len(token_ids))
print(token_ids)

"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don't exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we're treated to the site of Vincent Gallo's throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, and the same cannot be said for a man. In fact, you generally won't see female genitals in an American film in anything short of porn or explicit erotica. This alleged double-standard is less a double standard than an admittedly depressing ability to come to terms culturally with the insides of women's bodies.
length of tokens:  302
['"', 'i', 'am', 'curious', ':', 'yellow', '"', 'is', 'a', 'ri', '##sible', 'and', 'pre', '##ten', '##tious', 'steaming', 'pile', '.', 'it', 'doesn', "'", 't', 'matter', 'what', 'one', "'", 's', 'political', 'views', 'are', 'because', 'this', 'film', 'can', 'hardly', 'be', 'taken', 'seriously', 'on', 'any', 'level', '.', 'as', 'for', 'the', 'claim', 'that', 'frontal', 'male', 'nu', '##dity', 'is', 'an', 'automatic', 'nc', '-', '17', ',', 'that', 'isn', "'", 't', 'true', '.', 'i', "'", 've', 'seen', 'r', '-', 'rated', 'films', 'with', 'male', 'nu', '##dity', '.', 'granted', ',', 'they', 'only', 'offer', 'some', 'fleeting', 'views', ',', 'but', 'where', 'are', 'the', 'r', '-', 'rated', 'films', 'with', 'gaping', 'vu', '##lva', '##s', 'and', 'flap', '##ping', 'lab', '##ia', '?', 'nowhere', ',', 'because', 'they', 'don', "'", 't', 'exist', '.', 'the', 'same', 'goes', 'for', 'those', 'crap', '##py', 'cable', 'shows', ':', 'sc', '##hl', '##ong', '##s', 'swinging', 'in', 'the', 'breeze', 'but', 'not', 'a', 'clit', '##oris', 'in', 'sight', '.', 'and', 'those', 'pre', '##ten', '##tious', 'indie', 'movies', 'like', 'the', 'brown', 'bunny', ',', 'in', 'which', 'we', "'", 're', 'treated', 'to', 'the', 'site', 'of', 'vincent', 'gallo', "'", 's', 'throbbing', 'johnson', ',', 'but', 'not', 'a', 'trace', 'of', 'pink', 'visible', 'on', 'chloe', 'se', '##vi', '##gny', '.', 'before', 'crying', '(', 'or', 'implying', ')', '"', 'double', '-', 'standard', '"', 'in', 'matters', 'of', 'nu', '##dity', ',', 'the', 'mentally', 'ob', '##tus', '##e', 'should', 'take', 'into', 'account', 'one', 'una', '##vo', '##ida', '##bly', 'obvious', 'anatomical', 'difference', 'between', 'men', 'and', 'women', ':', 'there', 'are', 'no', 'gen', '##ital', '##s', 'on', 'display', 'when', 'actresses', 'appears', 'nude', ',', 'and', 'the', 'same', 'cannot', 'be', 'said', 'for', 'a', 'man', '.', 'in', 'fact', ',', 'you', 'generally', 'won', "'", 't', 'see', 'female', 'gen', '##ital', '##s', 'in', 'an', 'american', 'film', 'in', 'anything', 'short', 'of', 'porn', 'or', 'explicit', 'erotic', '##a', '.', 'this', 'alleged', 'double', '-', 'standard', 'is', 'less', 'a', 'double', 'standard', 'than', 'an', 'admitted', '##ly', 'de', '##pressing', 'ability', 'to', 'come', 'to', 'terms', 'culturally', 'with', 'the', 'insides', 'of', 'women', "'", 's', 'bodies', '.']
length of tokens:  304
['[CLS]', '"', 'i', 'am', 'curious', ':', 'yellow', '"', 'is', 'a', 'ri', '##sible', 'and', 'pre', '##ten', '##tious', 'steaming', 'pile', '.', 'it', 'doesn', "'", 't', 'matter', 'what', 'one', "'", 's', 'political', 'views', 'are', 'because', 'this', 'film', 'can', 'hardly', 'be', 'taken', 'seriously', 'on', 'any', 'level', '.', 'as', 'for', 'the', 'claim', 'that', 'frontal', 'male', 'nu', '##dity', 'is', 'an', 'automatic', 'nc', '-', '17', ',', 'that', 'isn', "'", 't', 'true', '.', 'i', "'", 've', 'seen', 'r', '-', 'rated', 'films', 'with', 'male', 'nu', '##dity', '.', 'granted', ',', 'they', 'only', 'offer', 'some', 'fleeting', 'views', ',', 'but', 'where', 'are', 'the', 'r', '-', 'rated', 'films', 'with', 'gaping', 'vu', '##lva', '##s', 'and', 'flap', '##ping', 'lab', '##ia', '?', 'nowhere', ',', 'because', 'they', 'don', "'", 't', 'exist', '.', 'the', 'same', 'goes', 'for', 'those', 'crap', '##py', 'cable', 'shows', ':', 'sc', '##hl', '##ong', '##s', 'swinging', 'in', 'the', 'breeze', 'but', 'not', 'a', 'clit', '##oris', 'in', 'sight', '.', 'and', 'those', 'pre', '##ten', '##tious', 'indie', 'movies', 'like', 'the', 'brown', 'bunny', ',', 'in', 'which', 'we', "'", 're', 'treated', 'to', 'the', 'site', 'of', 'vincent', 'gallo', "'", 's', 'throbbing', 'johnson', ',', 'but', 'not', 'a', 'trace', 'of', 'pink', 'visible', 'on', 'chloe', 'se', '##vi', '##gny', '.', 'before', 'crying', '(', 'or', 'implying', ')', '"', 'double', '-', 'standard', '"', 'in', 'matters', 'of', 'nu', '##dity', ',', 'the', 'mentally', 'ob', '##tus', '##e', 'should', 'take', 'into', 'account', 'one', 'una', '##vo', '##ida', '##bly', 'obvious', 'anatomical', 'difference', 'between', 'men', 'and', 'women', ':', 'there', 'are', 'no', 'gen', '##ital', '##s', 'on', 'display', 'when', 'actresses', 'appears', 'nude', ',', 'and', 'the', 'same', 'cannot', 'be', 'said', 'for', 'a', 'man', '.', 'in', 'fact', ',', 'you', 'generally', 'won', "'", 't', 'see', 'female', 'gen', '##ital', '##s', 'in', 'an', 'american', 'film', 'in', 'anything', 'short', 'of', 'porn', 'or', 'explicit', 'erotic', '##a', '.', 'this', 'alleged', 'double', '-', 'standard', 'is', 'less', 'a', 'double', 'standard', 'than', 'an', 'admitted', '##ly', 'de', '##pressing', 'ability', 'to', 'come', 'to', 'terms', 'culturally', 'with', 'the', 'insides', 'of', 'women', "'", 's', 'bodies', '.', '[SEP]']
length of token_ids:  304
[101, 1000, 1045, 2572, 8025, 1024, 3756, 1000, 2003, 1037, 15544, 19307, 1998, 3653, 6528, 20771, 19986, 8632, 1012, 2009, 2987, 1005, 1056, 3043, 2054, 2028, 1005, 1055, 2576, 5328, 2024, 2138, 2023, 2143, 2064, 6684, 2022, 2579, 5667, 2006, 2151, 2504, 1012, 2004, 2005, 1996, 4366, 2008, 19124, 3287, 16371, 25469, 2003, 2019, 6882, 13316, 1011, 2459, 1010, 2008, 3475, 1005, 1056, 2995, 1012, 1045, 1005, 2310, 2464, 1054, 1011, 6758, 3152, 2007, 3287, 16371, 25469, 1012, 4379, 1010, 2027, 2069, 3749, 2070, 25085, 5328, 1010, 2021, 2073, 2024, 1996, 1054, 1011, 6758, 3152, 2007, 21226, 24728, 22144, 2015, 1998, 20916, 4691, 6845, 2401, 1029, 7880, 1010, 2138, 2027, 2123, 1005, 1056, 4839, 1012, 1996, 2168, 3632, 2005, 2216, 10231, 7685, 5830, 3065, 1024, 8040, 7317, 5063, 2015, 11820, 1999, 1996, 9478, 2021, 2025, 1037, 17962, 21239, 1999, 4356, 1012, 1998, 2216, 3653, 6528, 20771, 10271, 5691, 2066, 1996, 2829, 16291, 1010, 1999, 2029, 2057, 1005, 2128, 5845, 2000, 1996, 2609, 1997, 6320, 25624, 1005, 1055, 17061, 3779, 1010, 2021, 2025, 1037, 7637, 1997, 5061, 5710, 2006, 9318, 7367, 5737, 19393, 1012, 2077, 6933, 1006, 2030, 20242, 1007, 1000, 3313, 1011, 3115, 1000, 1999, 5609, 1997, 16371, 25469, 1010, 1996, 10597, 27885, 5809, 2063, 2323, 2202, 2046, 4070, 2028, 14477, 6767, 8524, 6321, 5793, 28141, 4489, 2090, 2273, 1998, 2308, 1024, 2045, 2024, 2053, 8991, 18400, 2015, 2006, 4653, 2043, 19910, 3544, 15287, 1010, 1998, 1996, 2168, 3685, 2022, 2056, 2005, 1037, 2158, 1012, 1999, 2755, 1010, 2017, 3227, 2180, 1005, 1056, 2156, 2931, 8991, 18400, 2015, 1999, 2019, 2137, 2143, 1999, 2505, 2460, 1997, 22555, 2030, 13216, 14253, 2050, 1012, 2023, 6884, 3313, 1011, 3115, 2003, 2625, 1037, 3313, 3115, 2084, 2019, 4914, 2135, 2139, 24128, 3754, 2000, 2272, 2000, 3408, 20547, 2007, 1996, 19008, 1997, 2308, 1005, 1055, 4230, 1012, 102]

🧱 Step 4: Padding and Truncation¶

Before we feed data into a BERT model, every sentence needs to have the same number of tokens. Since natural text varies in length, we apply:

Truncation:
If a sentence is too long, we cut it down to max_length tokens.
→ In this notebook, we use max_length = 128.
Padding:
If a sentence is too short, we extend it with special [PAD] tokens until it reaches max_length.

When we call the tokenizer with:

tok(text, padding="max_length", truncation=True, max_length=128)

we get back several useful things:

input_ids: The tokenized sentence as integers (with padding/truncation).
attention_mask: Tells the model which tokens are real (1) and which are padding (0).
token_type_ids: Useful in tasks involving two sentences (e.g., next sentence prediction). All 0s here since we have only one sentence.

These ensure the inputs are well-formed and aligned into uniform matrices that the model can process efficiently.

In [29]:

Copied!





# Tokenize with padding and truncation
tokens = tok(text, padding="max_length", truncation=True, max_length=max_length)

# Display input IDs (tokenized & padded)
print("🧩 Token IDs:", tokens["input_ids"])
print("📏 Length:", len(tokens["input_ids"]))
print("Attention Mask:", tokens["attention_mask"])
print("Token Type IDs:", tokens["token_type_ids"])
# Tokenize with padding and truncation
tokens = tok(text, padding="max_length", truncation=True, max_length=max_length)

# Display input IDs (tokenized & padded)
print("🧩 Token IDs:", tokens["input_ids"])
print("📏 Length:", len(tokens["input_ids"]))
print("Attention Mask:", tokens["attention_mask"])
print("Token Type IDs:", tokens["token_type_ids"])

🧩 Token IDs: [101, 1000, 1045, 2572, 8025, 1024, 3756, 1000, 2003, 1037, 15544, 19307, 1998, 3653, 6528, 20771, 19986, 8632, 1012, 2009, 2987, 1005, 1056, 3043, 2054, 2028, 1005, 1055, 2576, 5328, 2024, 2138, 2023, 2143, 2064, 6684, 2022, 2579, 5667, 2006, 2151, 2504, 1012, 2004, 2005, 1996, 4366, 2008, 19124, 3287, 16371, 25469, 2003, 2019, 6882, 13316, 1011, 2459, 1010, 2008, 3475, 1005, 1056, 2995, 1012, 1045, 1005, 2310, 2464, 1054, 1011, 6758, 3152, 2007, 3287, 16371, 25469, 1012, 4379, 1010, 2027, 2069, 3749, 2070, 25085, 5328, 1010, 2021, 2073, 2024, 1996, 1054, 1011, 6758, 3152, 2007, 21226, 24728, 22144, 2015, 1998, 20916, 4691, 6845, 2401, 1029, 7880, 1010, 2138, 2027, 2123, 1005, 1056, 4839, 1012, 1996, 2168, 3632, 2005, 2216, 10231, 7685, 5830, 3065, 1024, 8040, 7317, 102]
📏 Length: 128
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Token Type IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Once you're confident this makes sense, we’ll apply it to the whole dataset using the map() function.

Output PyTorch-friendly tensors.

This is what the tokenize() function does for every review in the dataset. We then:

Replace the "text" column with tokenized input.
Rename the "label" column to "labels" to match Hugging Face’s expected format.
Convert everything to PyTorch tensors for training.

Now the dataset is ready for BERT to digest

In [13]:

Copied!





def tokenize(batch):
    return tok(batch["text"], padding="max_length", truncation=True, max_length=max_length)

dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
dataset = dataset.rename_column("label", "labels")
dataset.set_format("torch")
def tokenize(batch):
    return tok(batch["text"], padding="max_length", truncation=True, max_length=max_length)

dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
dataset = dataset.rename_column("label", "labels")
dataset.set_format("torch")

In [14]:

Copied!

dataset['train'][1]
dataset['train'][1]

Out[14]:

{'labels': tensor(0),
 'input_ids': tensor([  101,  1000,  1045,  2572,  8025,  1024,  3756,  1000,  2003,  1037,
         15544, 19307,  1998,  3653,  6528, 20771, 19986,  8632,  1012,  2009,
          2987,  1005,  1056,  3043,  2054,  2028,  1005,  1055,  2576,  5328,
          2024,  2138,  2023,  2143,  2064,  6684,  2022,  2579,  5667,  2006,
          2151,  2504,  1012,  2004,  2005,  1996,  4366,  2008, 19124,  3287,
         16371, 25469,  2003,  2019,  6882, 13316,  1011,  2459,  1010,  2008,
          3475,  1005,  1056,  2995,  1012,  1045,  1005,  2310,  2464,  1054,
          1011,  6758,  3152,  2007,  3287, 16371, 25469,  1012,  4379,  1010,
          2027,  2069,  3749,  2070, 25085,  5328,  1010,  2021,  2073,  2024,
          1996,  1054,  1011,  6758,  3152,  2007, 21226, 24728, 22144,  2015,
          1998, 20916,  4691,  6845,  2401,  1029,  7880,  1010,  2138,  2027,
          2123,  1005,  1056,  4839,  1012,  1996,  2168,  3632,  2005,  2216,
         10231,  7685,  5830,  3065,  1024,  8040,  7317,   102]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1])}

5.2 Build the BERT Model¶

We now load a pretrained BERT model and adapt it for our classification task.

AutoModelForSequenceClassification.from_pretrained(...)

This gives us a model that:

✅ Starts with pretrained BERT (in our case: bert-base-uncased) ✅ Adds a brand-new classification head on top (a dense layer that outputs class logits) ✅ Is ready to be fine-tuned on our task (sentiment classification)

🧠 What’s Inside?¶

The lower layers of BERT:
- A tokenizer + embedding layer
- 12 Transformer encoder blocks (attention + feedforward layers)
- These come with pretrained weights learned from huge corpora (like Wikipedia + BooksCorpus)
The upper layer we just added:
- A fresh, untrained classification head (linear layer)
- Maps the [CLS] token’s output into num_labels predictions (→ In our case: 2, for binary sentiment classification)

🔧 Why Use `AutoModelForSequenceClassification`?¶

Hugging Face's transformers library automatically:

Chooses the correct architecture for your model checkpoint (bert, roberta, distilbert, etc.)
Adds a task-specific head (here, classification)
Handles token types, attention masks, and outputs correctly

✅ We're now ready to train a BERT model that's already fluent in English — we just need to teach it sentiment-specific nuances!

In [15]:

Copied!





model = AutoModelForSequenceClassification.from_pretrained(
    model_ckpt,
    num_labels = num_labels,
    problem_type = "single_label_classification"
)
model = AutoModelForSequenceClassification.from_pretrained(
    model_ckpt,
    num_labels = num_labels,
    problem_type = "single_label_classification"
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

5.3 Training (set up `Trainer` + Early Stopping)¶

🎯 Evaluation Metrics¶

To evaluate our model's performance, we define a compute_metrics() function.

This function returns:

Accuracy: How many predictions were correct?
Precision: Of the predicted positives, how many were correct?
Recall: Of the actual positives, how many did we catch?

These are standard metrics for binary classification, and we use the 🤗 evaluate library to compute them.

In [16]:

Copied!





metric_accuracy  = evaluate.load("accuracy")
metric_precision = evaluate.load("precision")
metric_recall    = evaluate.load("recall")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    acc  = metric_accuracy.compute(predictions=preds, references=labels)
    prec = metric_precision.compute(predictions=preds, references=labels, average="binary")
    rec  = metric_recall.compute(predictions=preds, references=labels, average="binary")

    # merge into a single dict that Trainer can log
    return {
        "accuracy":  acc["accuracy"],
        "precision": prec["precision"],
        "recall":    rec["recall"],
    }
metric_accuracy  = evaluate.load("accuracy")
metric_precision = evaluate.load("precision")
metric_recall    = evaluate.load("recall")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    acc  = metric_accuracy.compute(predictions=preds, references=labels)
    prec = metric_precision.compute(predictions=preds, references=labels, average="binary")
    rec  = metric_recall.compute(predictions=preds, references=labels, average="binary")

    # merge into a single dict that Trainer can log
    return {
        "accuracy":  acc["accuracy"],
        "precision": prec["precision"],
        "recall":    rec["recall"],
    }

⚙️ Configure TrainingArguments¶

We now define the training configuration using Hugging Face’s TrainingArguments. TrainingArguments defines how your model is trained and evaluated.

Here’s a breakdown of key arguments and why we chose them:

output_dir="./bert_cls"
➤ Where all checkpoints, logs, and config files are saved.
eval_strategy="epoch"
➤ Evaluate the model after every full pass through the training set.
save_strategy="epoch"
➤ Save model checkpoints after each epoch (only 2 best ones kept, see below).
logging_strategy="steps"
➤ Log progress every logging_steps=50 training steps.
learning_rate=2e-5
➤ A standard learning rate for BERT fine-tuning.
🔁 You can adjust this (e.g., 5e-5, 3e-5) — it's often the first thing to tune.
per_device_train_batch_size=batch_size
➤ Batch size per GPU.
🔁 You can lower it if memory is tight (e.g., 32, 16).
num_train_epochs=10
➤ Max number of epochs (we’ll stop early if no improvement — see next cell).
weight_decay=0.01
➤ Regularization to avoid overfitting. Helps generalization.
load_best_model_at_end=True
➤ After training, load the checkpoint with the lowest eval loss.
metric_for_best_model="eval_loss"
➤ Tells the trainer which metric to use to select the "best" model.
save_total_limit=2
➤ Keep only the two best models to save space.
fp16=True
➤ Use mixed precision training (if GPU supports it) — speeds things up and uses less memory.
gradient_accumulation_steps=2
➤ Simulates larger batch sizes by updating weights every 2 steps.

🛠️ Tuning Tips:

For small datasets: lower learning_rate and fewer epochs (e.g., 3–5).
For large datasets: increase num_train_epochs or use eval_strategy="steps".

In [ ]:

Copied!





args = TrainingArguments(
    output_dir          = "./bert_cls",
    eval_strategy       = "epoch",
    save_strategy       = "epoch",
    logging_strategy    = "steps",
    logging_steps       = 50,
    learning_rate       = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size  = batch_size,
    num_train_epochs    = 10,
    weight_decay        = 0.01,
    load_best_model_at_end = True,
    metric_for_best_model = "eval_loss",
    save_total_limit    = 2,
    seed                = seed,
    report_to           = "none",  # Disable wandb
    fp16                = True, # Enable mixed precision training
    gradient_accumulation_steps = 2
)
args = TrainingArguments(
    output_dir          = "./bert_cls",
    eval_strategy       = "epoch",
    save_strategy       = "epoch",
    logging_strategy    = "steps",
    logging_steps       = 50,
    learning_rate       = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size  = batch_size,
    num_train_epochs    = 10,
    weight_decay        = 0.01,
    load_best_model_at_end = True,
    metric_for_best_model = "eval_loss",
    save_total_limit    = 2,
    seed                = seed,
    report_to           = "none",  # Disable wandb
    fp16                = True, # Enable mixed precision training
    gradient_accumulation_steps = 2
)

🏋️ Set Up the Trainer¶

We now initialize the 🤗 Trainer, which handles training, evaluation, and logging for us. The Trainer class from Hugging Face wraps everything you need for training into one easy-to-use object.

It handles:

Running training loops
Evaluating your model
Saving logs and checkpoints
Applying callbacks like early stopping

Key components:

model: Our BERT model for sequence classification.
args: The TrainingArguments we just defined.
train_dataset: The IMDB training set, already tokenized and formatted.
eval_dataset: The test set, shuffled to ensure a good evaluation sample.
compute_metrics: Our custom function for accuracy, precision, and recall.
EarlyStoppingCallback(early_stopping_patience=3):
➤ If evaluation loss doesn’t improve for 3 epochs, training will stop early.
➤ Helps avoid overfitting and saves time.

🔁 You can reuse this Trainer setup for many fine-tuning tasks — just swap the model and dataset!

In [ ]:

Copied!





trainer = Trainer(
    model           = model,
    args            = args,
    train_dataset   = dataset["train"],
    eval_dataset    = dataset["test"].shuffle(seed=seed),  # small held-out set for early-stop
    compute_metrics = compute_metrics,
    callbacks       = [EarlyStoppingCallback(early_stopping_patience=3)]
)
trainer = Trainer(
    model           = model,
    args            = args,
    train_dataset   = dataset["train"],
    eval_dataset    = dataset["test"].shuffle(seed=seed),  # small held-out set for early-stop
    compute_metrics = compute_metrics,
    callbacks       = [EarlyStoppingCallback(early_stopping_patience=3)]
)

5.4 Train the Model 🚀¶

Just one line of code !

In [17]:

Copied!

train_output = trainer.train()
train_output = trainer.train()

[ 980/1950 06:19 < 06:16, 2.58 it/s, Epoch 5/10]

Epoch	Training Loss	Validation Loss	Accuracy	Precision	Recall
1	0.318100	0.288745	0.877800	0.850776	0.916320
2	0.227900	0.273267	0.887760	0.898209	0.874640
3	0.159500	0.303185	0.889200	0.878128	0.903840
4	0.093700	0.365608	0.882120	0.907238	0.851280
5	0.064300	0.425973	0.886440	0.869728	0.909040

5.5 Training Curves 📈¶

Let’s plot loss & accuracy logged in trainer.state.log_history.
We’ll visualise train vs eval so you can spot over-fitting at a glance.

In [18]:

Copied!





# Collect logs ---------------------------------------------------------------
logs = trainer.state.log_history
train_loss, eval_loss, eval_acc, steps_eval, steps_train = [], [], [], [], []

for e in logs:
    if 'loss' in e and e.get("epoch") is not None:
        train_loss.append(e['loss'])
        steps_train.append(e['epoch'])
    if 'eval_loss' in e:
        eval_loss.append(e['eval_loss'])
        eval_acc.append(e['eval_accuracy'])
        steps_eval.append(e['epoch'])

# Plot -----------------------------------------------------------------------
plt.figure(figsize=(6,4))
plt.plot(steps_train, train_loss, label="train-loss")
plt.plot(steps_eval, eval_loss, label="eval-loss")
plt.xlabel("Epoch"); plt.ylabel("Loss"); plt.legend(); plt.title("BERT Loss Curves")
plt.show()

plt.figure(figsize=(6,4))
plt.plot(steps_eval, eval_acc, label="eval-accuracy")
plt.xlabel("Epoch"); plt.ylabel("Accuracy"); plt.legend(); plt.title("BERT Accuracy Curve")
plt.show()
# Collect logs ---------------------------------------------------------------
logs = trainer.state.log_history
train_loss, eval_loss, eval_acc, steps_eval, steps_train = [], [], [], [], []

for e in logs:
    if 'loss' in e and e.get("epoch") is not None:
        train_loss.append(e['loss'])
        steps_train.append(e['epoch'])
    if 'eval_loss' in e:
        eval_loss.append(e['eval_loss'])
        eval_acc.append(e['eval_accuracy'])
        steps_eval.append(e['epoch'])

# Plot -----------------------------------------------------------------------
plt.figure(figsize=(6,4))
plt.plot(steps_train, train_loss, label="train-loss")
plt.plot(steps_eval, eval_loss, label="eval-loss")
plt.xlabel("Epoch"); plt.ylabel("Loss"); plt.legend(); plt.title("BERT Loss Curves")
plt.show()

plt.figure(figsize=(6,4))
plt.plot(steps_eval, eval_acc, label="eval-accuracy")
plt.xlabel("Epoch"); plt.ylabel("Accuracy"); plt.legend(); plt.title("BERT Accuracy Curve")
plt.show()

5.6 Final Evaluation (on the Test Set)¶

We now evaluate on the full test split that was never touched during training.

In [19]:

Copied!

test_metrics = trainer.evaluate(dataset["test"])
print(test_metrics)
test_metrics = trainer.evaluate(dataset["test"])
print(test_metrics)

[ 4/391 00:00 < 00:17, 22.73 it/s]

{'eval_loss': 0.27326661348342896, 'eval_accuracy': 0.88776, 'eval_precision': 0.8982090042720999, 'eval_recall': 0.87464, 'eval_runtime': 17.5668, 'eval_samples_per_second': 1423.138, 'eval_steps_per_second': 22.258, 'epoch': 5.0}

5.7 Compare with other models¶

In [20]:

Copied!





import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Calculate metrics for LogReg (TF-IDF)
logreg_accuracy = accuracy_score(y_test, y_pred_logreg)
logreg_precision = precision_score(y_test, y_pred_logreg, average='weighted')
logreg_recall = recall_score(y_test, y_pred_logreg, average='weighted')
logreg_f1 = f1_score(y_test, y_pred_logreg, average='weighted')

# Calculate metrics for LSTM
lstm_accuracy = accuracy_score(y_test_lstm, y_pred_lstm)
lstm_precision = precision_score(y_test_lstm, y_pred_lstm, average='weighted')
lstm_recall = recall_score(y_test_lstm, y_pred_lstm, average='weighted')
lstm_f1 = f1_score(y_test_lstm, y_pred_lstm, average='weighted')

# Extract BERT metrics from test_metrics (assuming this structure)
# If your test_metrics has a different structure, adjust accordingly
bert_accuracy = test_metrics.get('eval_accuracy', 0)
bert_precision = test_metrics.get('eval_precision', 0)
bert_recall = test_metrics.get('eval_recall', 0)
bert_f1 = test_metrics.get('eval_f1', 2 * bert_recall * bert_precision / (bert_recall + bert_precision))

# Create a DataFrame for easy plotting
models = ['TF-IDF+LogReg', 'LSTM', 'BERT']
metrics_df = pd.DataFrame({
    'Accuracy': [logreg_accuracy, lstm_accuracy, bert_accuracy],
    'Precision': [logreg_precision, lstm_precision, bert_precision],
    'Recall': [logreg_recall, lstm_recall, bert_recall],
    'F1-Score': [logreg_f1, lstm_f1, bert_f1]
}, index=models)

fig, axes = plt.subplots(1, 4, figsize=(16, 6), sharey=True)
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']

# Colors for each model
colors = ['#3498db', '#2ecc71', '#e74c3c']  # Blue, Green, Red

# Plot each metric in a separate subplot
for i, metric in enumerate(metrics):
    ax = axes[i]
    bars = ax.bar(models, metrics_df[metric], color=colors, alpha=0.8)
    
    # Add value labels on top of bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{height:.3f}', ha='center', va='bottom', fontsize=9)
    
    # Set title and customize
    ax.set_title(metric, fontsize=14)
    ax.set_ylim(0, 1.05)
    ax.grid(axis='y', alpha=0.3)
    
    # Only show y-axis label on the first subplot
    if i == 0:
        ax.set_ylabel('Score', fontsize=12)
    
    # Rotate x-axis labels to avoid overlap
    plt.setp(ax.get_xticklabels(), rotation=30, ha='right')

# Add a common legend
fig.legend(
    [plt.Rectangle((0,0),1,1, color=color) for color in colors],
    models,
    loc='upper center', 
    bbox_to_anchor=(0.5, 0.05),
    ncol=3,
    fontsize=12
)

plt.suptitle('Performance Comparison of Text Classification Models', fontsize=16)
plt.tight_layout(rect=[0, 0.1, 1, 0.95])  # Adjust layout to accommodate the legend
plt.show()
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Calculate metrics for LogReg (TF-IDF)
logreg_accuracy = accuracy_score(y_test, y_pred_logreg)
logreg_precision = precision_score(y_test, y_pred_logreg, average='weighted')
logreg_recall = recall_score(y_test, y_pred_logreg, average='weighted')
logreg_f1 = f1_score(y_test, y_pred_logreg, average='weighted')

# Calculate metrics for LSTM
lstm_accuracy = accuracy_score(y_test_lstm, y_pred_lstm)
lstm_precision = precision_score(y_test_lstm, y_pred_lstm, average='weighted')
lstm_recall = recall_score(y_test_lstm, y_pred_lstm, average='weighted')
lstm_f1 = f1_score(y_test_lstm, y_pred_lstm, average='weighted')

# Extract BERT metrics from test_metrics (assuming this structure)
# If your test_metrics has a different structure, adjust accordingly
bert_accuracy = test_metrics.get('eval_accuracy', 0)
bert_precision = test_metrics.get('eval_precision', 0)
bert_recall = test_metrics.get('eval_recall', 0)
bert_f1 = test_metrics.get('eval_f1', 2 * bert_recall * bert_precision / (bert_recall + bert_precision))

# Create a DataFrame for easy plotting
models = ['TF-IDF+LogReg', 'LSTM', 'BERT']
metrics_df = pd.DataFrame({
    'Accuracy': [logreg_accuracy, lstm_accuracy, bert_accuracy],
    'Precision': [logreg_precision, lstm_precision, bert_precision],
    'Recall': [logreg_recall, lstm_recall, bert_recall],
    'F1-Score': [logreg_f1, lstm_f1, bert_f1]
}, index=models)

fig, axes = plt.subplots(1, 4, figsize=(16, 6), sharey=True)
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']

# Colors for each model
colors = ['#3498db', '#2ecc71', '#e74c3c']  # Blue, Green, Red

# Plot each metric in a separate subplot
for i, metric in enumerate(metrics):
    ax = axes[i]
    bars = ax.bar(models, metrics_df[metric], color=colors, alpha=0.8)
    
    # Add value labels on top of bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{height:.3f}', ha='center', va='bottom', fontsize=9)
    
    # Set title and customize
    ax.set_title(metric, fontsize=14)
    ax.set_ylim(0, 1.05)
    ax.grid(axis='y', alpha=0.3)
    
    # Only show y-axis label on the first subplot
    if i == 0:
        ax.set_ylabel('Score', fontsize=12)
    
    # Rotate x-axis labels to avoid overlap
    plt.setp(ax.get_xticklabels(), rotation=30, ha='right')

# Add a common legend
fig.legend(
    [plt.Rectangle((0,0),1,1, color=color) for color in colors],
    models,
    loc='upper center', 
    bbox_to_anchor=(0.5, 0.05),
    ncol=3,
    fontsize=12
)

plt.suptitle('Performance Comparison of Text Classification Models', fontsize=16)
plt.tight_layout(rect=[0, 0.1, 1, 0.95])  # Adjust layout to accommodate the legend
plt.show()

📊 Final Comparison: TF-IDF + Logistic Regression vs. LSTM vs. BERT¶

We've now compared three models for sentiment classification:

TF-IDF + Logistic Regression
- Strong baseline.
- Surprisingly competitive.
- Fastest to train, interpretable, but relies on sparse word frequency features.
LSTM (Recurrent Neural Network)
- Captures sequential patterns in text.
- Slower to train, less accurate.
- Struggles with long dependencies compared to attention-based models.
BERT (Transformer-based model)
- Best performance out-of-the-box.
- Learns deep contextual embeddings from pre-trained Transformers.
- No custom tuning yet, still outperforms other methods.

🧠 Why is BERT better?¶

It leverages pre-trained knowledge from vast corpora (Wikipedia, BooksCorpus).
Even without adjusting any hyperparameters, it achieves 88.8% accuracy, beating TF-IDF+LogReg and LSTM.
It benefits from contextual understanding of words, not just frequency or sequence.

🚀 Next Steps to Improve BERT¶

Swap the backbone: Try roberta-base, deberta-v3-small, or even ModernBERT.
Domain adaptation: Resume pretraining or fine-tuning on movie reviews specifically.
Error analysis: Look at false positives/negatives to uncover patterns and improve preprocessing.
Architecture tweaking: Add dense layers, dropout, or residuals above BERT to improve generalization.

🧾 Model Comparison Summary¶

Model	Accuracy	Precision	Recall	F1-Score	Train Time	Notes
TF-IDF + LogReg	0.880	0.880	0.880	0.880	~few seconds	Fastest, no deep learning needed
LSTM	0.779	0.779	0.779	0.779	~few minutes	Worse results, slower training
BERT (base uncased)	0.888	0.898	0.875	0.886	~few minutes - needs GPU	Best performance, no tuning applied

🔎 Final Word¶

To go further, we must perform error analysis:

Which inputs were misclassified?
Are they noisy? Ambiguous? Long or rare?

This will guide whether we need:

Better preprocessing?
A stronger model?

BERT Implementation with HF vs LSTM Text Classification vs. TF-IDF + Logistic Regression¶

1. Environment Setup & Imports¶

2. Data Loading & Preprocessing¶

3. TF-IDF + Logistic Regression¶

Visualization of Confusion Matrix¶

4: LSTM Classification¶

4.1 Preprocessing the Text for LSTM¶

4.2 Build the LSTM Model¶

4.3 Train the LSTM Model¶

4.4 Training Curves¶

4.5 Evaluate on Test Data¶

5 : BERT (Text Classification)¶

🤔 Why Move From LSTM to BERT?¶

🤖 What is BERT?¶

🧩 What Are We Doing here?¶

🔍 Where Do Pretrained Models Come From?¶

⚡ Why You Should Use a GPU¶

🧰 Setup & Imports¶

🔧 Key Configuration Parameters¶

5.1 Pre-Processing (Tokenizer)¶

🧪 Tokenize the Text (BERT Can’t Read Raw Text!)¶

✏️ Example: A Single Sentence¶

🔢 Step 1: Tokenization into Subwords¶

🔁 Step 2: Add Special Tokens¶

🔢 Step 3: Convert Tokens to IDs¶

🧪 Example:¶

🧱 Step 4: Padding and Truncation¶

5.2 Build the BERT Model¶

🧠 What’s Inside?¶

🔧 Why Use AutoModelForSequenceClassification?¶

5.3 Training (set up Trainer + Early Stopping)¶

🎯 Evaluation Metrics¶

⚙️ Configure TrainingArguments¶

🏋️ Set Up the Trainer¶

5.4 Train the Model 🚀¶

5.5 Training Curves 📈¶

5.6 Final Evaluation (on the Test Set)¶

5.7 Compare with other models¶

📊 Final Comparison: TF-IDF + Logistic Regression vs. LSTM vs. BERT¶

🧠 Why is BERT better?¶

🚀 Next Steps to Improve BERT¶

🧾 Model Comparison Summary¶

🔎 Final Word¶

🔧 Why Use `AutoModelForSequenceClassification`?¶

5.3 Training (set up `Trainer` + Early Stopping)¶