Advanced Methods in Natural Language Processing - Session 8¶
Table of Contents¶
-
- 1.1. Loading and Exploring Data
- 1.2. Loading a BERT from HF hub
- 1.3. Training BERT
- 1.4. Learning Curve - Model Evaluation
-
- 2.1. Loading SetFit
- 2.2. Training the SetFit with 32 examples
- 2.3. Try to augment the data with prompt
- 2.4. Model Evaluation
Part 3: Biases in BERT models - WinoGrad schemas
- 3.1. Gender Biases
- 3.2. Understanding of the agent
Part 0: Metrics Functions to Consider¶
Before diving into the model building and training, it's crucial to establish the metrics we'll use to evaluate our models. In this part, we will define and discuss the different metrics functions that are commonly used in NLP tasks, particularly for text classification:
Accuracy: Measures the proportion of correct predictions among the total number of cases examined. It's a straightforward metric but can be misleading if the classes are imbalanced.
Precision and Recall: Precision measures the proportion of positive identifications that were actually correct, while recall measures the proportion of actual positives that were identified correctly. These metrics are especially important when dealing with imbalanced datasets.
F1 Score: The harmonic mean of precision and recall. It's a good way to show that a classifer has a good balance between precision and recall.
Confusion Matrix: A table used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.
ROC and AUC: The receiver operating characteristic curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system. The area under the curve (AUC) represents measure of separability.
We will implement these metrics functions using libraries such as scikit-learn, and they will be used to assess and compare the performance of our different models throughout this exercise.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
class Metrics:
def __init__(self):
self.results = {}
def run(self, y_true, y_pred, method_name, average='macro'):
# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average=average)
recall = recall_score(y_true, y_pred, average=average)
f1 = f1_score(y_true, y_pred, average=average)
# Store results
self.results[method_name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
}
def plot(self):
# Create subplots
fig, axs = plt.subplots(2, 2, figsize=(15, 10))
# Plot each metric
for i, metric in enumerate(['accuracy', 'precision', 'recall', 'f1']):
ax = axs[i//2, i%2]
values = [res[metric] * 100 for res in self.results.values()]
ax.bar(self.results.keys(), values)
ax.set_title(metric)
ax.set_ylim(0, 100)
# Add values on the bars
for j, v in enumerate(values):
ax.text(j, v + 0.02, f"{v:.2f}", ha='center', va='bottom')
plt.tight_layout()
plt.show()
Part 1 - Fine tuning BERT models¶
In this part, we will create a baseline model for text classification with BERT. This involves:
1. Loading and Exploring Data:¶
We will load the AG News corpus and perform necessary preprocessing steps like exploring the dataset.
from datasets import load_dataset
# Load the 'ag_news' dataset
dataset = load_dataset("ag_news")
# Explore the structure of the dataset
print(dataset)
Let's create stratified samples for training and validation sets ensuring that each class is represented in proportion to its frequency. It will go faster with just a sample, and we will be able to make tests on validation test before trying to work on the testing set.
from sklearn.model_selection import train_test_split
data = dataset['train']['text']
labels = dataset['train']['label']
test_data = dataset['test']['text']
test_labels = dataset['test']['label']
# Stratified split to create a smaller training and validation set
train_data, valid_data, train_labels, valid_labels = train_test_split(
data, labels, stratify=labels, test_size=0.2, random_state=42
)
# Further split to get 10k and 2k samples respectively
train_data, _, train_labels, _ = train_test_split(
train_data, train_labels, stratify=train_labels, train_size=10000, random_state=42
)
valid_data, _, valid_labels, _ = train_test_split(
valid_data, valid_labels, stratify=valid_labels, train_size=2000, random_state=42
)
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import defaultdict
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
labels = {0: 'World', 1: 'Sports',
2: 'Business', 3: 'Sci/Tech'}
# Prepare data for wordclouds
label_data = defaultdict(lambda: '')
for text, label in zip(train_data, train_labels):
label_data[label] += text
# Generate and plot wordclouds for each label
fig, axs = plt.subplots(2, 2, figsize=(10, 6)) # Create 2x2 subplots
axs = axs.flatten() # Flatten the axis array
for ax, (label, text) in zip(axs, label_data.items()):
wordcloud = WordCloud(stopwords=stop_words, background_color='white').generate(text)
ax.imshow(wordcloud, interpolation='bilinear')
ax.set_title('WordCloud for Label {}'.format(labels.get(label)))
ax.axis('off')
plt.tight_layout()
plt.show()
from collections import Counter
import matplotlib.pyplot as plt
# Count the frequency of each label
label_counts = Counter(train_labels)
# Data to plot
_labels = [labels.get(lab) for lab in label_counts.keys()]
sizes = label_counts.values()
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
# Plotting the pie chart
plt.pie(sizes, labels=_labels, colors=colors, autopct='%1.1f%%', startangle=140)
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Proportion of Each Label')
plt.show()
Let's have a baseline with TF-IDF + logistic regression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# Create a pipeline with TF-IDF and Logistic Regression
pipeline = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2),
min_df=5,
stop_words='english')),
('clf', LogisticRegression(solver='liblinear')),
])
# Fit the pipeline on the training data
pipeline.fit(train_data, train_labels)
valid_preds = pipeline.predict(valid_data)
metrics_val= Metrics()
metrics_val.run(valid_labels, valid_preds, "basic TF-IDF")
metrics_val.plot()
2. Import BERT components:¶
The first thing is to load from HF.
from transformers import AutoTokenizer, TFAutoModel
checkpoint= "distilbert-base-uncased"# let's go faster !
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModel.from_pretrained(checkpoint)
batch_size=64
max_length=64
rate = 0.5
num_labels = len(np.unique(valid_labels))
tokenizer.batch_encode_plus(['I am a BSE student'], add_special_tokens=True, max_length=max_length,
padding='max_length', return_attention_mask=True,
return_token_type_ids=True, truncation=True,
return_tensors="np")
tokenizer.batch_encode_plus(['I am a [MASK] student'], add_special_tokens=True, max_length=max_length,
padding='max_length', return_attention_mask=True,
return_token_type_ids=True, truncation=True,
return_tensors="np")
3. Implement the layer on top of BERT¶
We need to add one layer upon the BERT pre-trained model to leverage the knowledge. Btw, we will create one model where we freeze BERT weights and another with training all the network.
BERT needs two types of inputs:
- input_ids: which are the ids of the tokens once the sentence is tokenized
- input_masks_in: To indicate to the model if we consider the token or not
Indeed, as we pad every sentence to a length of 64 tokens, sometimes there are less tokens ans do we will have 0s to indicate to the model not to consider those tokens.
The architecture is:
- Inputs ids and Inputs Masks
- Embedding layer --> the BERT model to process the inputs
- A layer upon the embeddings layer to convert the final [CLS] token into probabilities.
import tensorflow as tf
## Input
input_ids_in = tf.keras.layers.Input(shape=(max_length,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(max_length,), name='masked_token', dtype='int32')
# Embedding layers
# we need only the first token representation nothing else !
embedding_layer = model(input_ids_in, attention_mask=input_masks_in)[0][:,0,:]
Let's generate some batches. In order no to go
from sklearn.utils import shuffle
def get_batches(X_train, y_train, tokenizer, batch_size, max_length):
"""
Objective: Create a generator that yields batches of tokenized text and corresponding labels.
The data is shuffled and looped through indefinitely.
Inputs:
- X_train (np.array): Array of text data (features).
- y_train (np.array): Array of labels.
- tokenizer (DistilBertTokenizer): Tokenizer for text data.
- batch_size (int): Size of each batch.
- max_length (int): Maximum length of tokenized sequences.
Outputs:
- Generator yielding batches of (inputs, targets).
"""
# Pre-tokenize the entire dataset
inputs = tokenizer.batch_encode_plus(list(X_train), add_special_tokens=True, max_length=max_length,
padding='max_length', return_attention_mask=True,
return_token_type_ids=True, truncation=True,
return_tensors="np")
input_ids = np.asarray(inputs['input_ids'], dtype='int32')
attention_masks = np.asarray(inputs['attention_mask'], dtype='int32')
# Shuffle and yield batches
while True:
X_train, y_train, input_ids, attention_masks = shuffle(X_train, y_train, input_ids, attention_masks, random_state=11)
for i in range(0, len(X_train), batch_size):
yield [input_ids[i:i + batch_size], attention_masks[i:i + batch_size]], y_train[i:i + batch_size]
We need to encode the y_train
as a one hot encoder to have the four classes as in our architecture.
from sklearn.preprocessing import OneHotEncoder
enc =
y_train =
from tensorflow.keras.optimizers import Adam
X_train = np.array(train_data)
steps_per_epoch = int(len(X_train) / batch_size)
batches = get_batches(X_train, y_train, tokenizer, batch_size, max_length)
#Compile the model as for th RNN
bert_model.compile(
#Fit the model
bert_model
Let's now take a look at the results
inputs =
inputs_valid =
valid_preds =
valid_preds =
metrics_val.run(valid_labels, valid_preds, "distilBERT + softmax")
metrics_val.plot()
Let's try to freeze the BERT models to see the differences and the knowlegde of BERT.
import tensorflow as tf
## Input
input_ids_in =
input_masks_in =
# Embedding layers
# we need only the first token representation nothing else !
embedding_layer =
# Let's add some dropout to reduce overfitting
output_layer =
# One dense layer to process the last layer
output =
bert_model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = output)
bert_model.layers[2].trainable = False
bert_model.summary()
batches = get_batches(X_train, y_train, tokenizer, batch_size, max_length)
#Compile the model with gradient descent algorithm and metrics
bert_model.compile(
)
#fit the model
bert_model
valid_preds =
valid_preds =
metrics_val.run(valid_labels, valid_preds, "distilBERT frozen + Softmax")
metrics_val.plot()
Part 2: Few Shot Learning¶
In this part, we'll explore few shot learning with SetFit. We will train the model with low number of examples and try to augment the dataset with prompts !
Loading SetFit¶
Let's begin by setting-up the environment
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset
Training the SetFit with 32 examples¶
from datasets import Dataset
# Convert data into Dataset object from datasets
train_dict = {'text': train_data, 'label': train_labels}
train_dataset = Dataset.from_dict(train_dict)
dataset_dict = {'train': train_dataset}
#let's sample 32 examples at first to see results
train_dataset = sample_dataset(
)
train_dataset
#Model to load
model = SetFitModel.from_pretrained(
)
#Arguments / hyperparamters to train
args = TrainingArguments(
)
# Trainer class to train afterwards
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset_augmented,
metric="accuracy",
)
# Train and evaluate
trainer
valid_preds =
metrics_val.run(valid_labels, valid_preds, "SetFit 32 examples")
metrics_val.plot()
Try to augment the data with prompt¶
labels = {0: 'World', 1: 'Sports',
2: 'Business', 3: 'Sci/Tech'}
labels_to_id = {value:key for key, value in labels.items()}
from transformers import pipeline
# Load zero-shot classification pipeline
classifier =
# Define the sequence to classify
texts =
# Define the candidate labels
candidate_labels =
# Perform zero-shot classification
results =
#select texts
texts =
#apply classifier
results =
labels_to_id = {value:key for key, value in labels.items()}
new_texts = []
new_labels = []
th = 0.6
for text, result in zip(texts, results):
if max(result.get('scores')) < th:
continue
new_texts.append(text)
new_labels.append(labels_to_id.get(result.get('labels')[0]))
np.unique(new_labels, return_counts=True)
train_data_augmented = train_dataset['text'] + new_texts
train_labels_augmented = train_dataset['label'] + new_labels
train_dict_augmented = {'text': train_data_augmented, 'label': train_labels_augmented}
train_dataset_augmented = Dataset.from_dict(train_dict_augmented)
train_dataset_augmented
#Model to load
model = SetFitModel.from_pretrained(
)
#Arguments / hyperparamters to train
args = TrainingArguments(
)
# Trainer class to train afterwards
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset_augmented,
metric="accuracy",
)
# Train and evaluate
trainer
Model Evaluation¶
#Makes some preds
valid_preds =
metrics_val.run(valid_labels, valid_preds, "SetFit augmented")
metrics_val.plot()
Part 3: Winogender Schemas¶
In this work, we will explore biases of the BERT model. We will begin with the Winograd Schemas adapted to gender: Winogender schemas. This dataset is extracted from Rudinger et al. (2018).
From Wikipedia:
The Winograd schema challenge (WSC) is a test of machine intelligence proposed by Hector Levesque, a computer scientist at the University of Toronto. Designed to be an improvement on the Turing test, it is a multiple-choice test that employs questions of a very specific structure: they are instances of what are called Winograd schemas. Questions of this form may be tailored to require knowledge and commonsense reasoning in a variety of domains.
A Winograd schema challenge question consists of three parts:
- A sentence or brief discourse that contains the following:
- Two noun phrases of the same semantic class (male, female, inanimate or group of objects or people),
- An ambiguous pronoun that may refer to either of the above noun phrases, and
- A special word and alternate word, such that if the special word is replaced with the alternate word, the natural resolution of the pronoun changes.
- A question asking the identity of the ambiguous pronoun, and
- Two answer choices corresponding to the noun phrases in question.
A machine will be given the problem in a standardized form which includes the answer choices, thus making it a binary decision problem.
import pandas as pd
url = 'https://raw.githubusercontent.com/rudinger/winogender-schemas/master/data/templates.tsv'
df = pd.read_csv(url, sep='\t')
df.loc[:, 'whole_sentence'] = df.loc[:, ['sentence',
'occupation(0)',
'other-participant(1)']].apply(lambda x:
x[0].replace('$OCCUPATION', x[1]).replace('$PARTICIPANT', x[2]), axis=1)
df.head()
Gender bias identification¶
In this part, we will take a look at ungendered occupation / participant, and see what are the main distribution of gender pronouns.
We will use simp,ly the pipeline to fill mask from huggingface that is really efficient. We can do it for several different models in order to see what are the differences between models
BERT-large¶
import re
checkpoint="bert-large-uncased"
classifier = pipeline('fill-mask', model=checkpoint)
regex = '\$(\w+)'
df.loc[:, 'pronoun'] = df.loc[:, 'whole_sentence'].apply(lambda x: re.findall(regex, x)[0])
words_to_replace = ['\$' + x for x in df.loc[:, 'pronoun'].unique()]
regex = r'(?:{})'.format('|'.join(words_to_replace))
df.loc[:, 'sentence_mask'] = df.loc[:, 'whole_sentence'].apply(lambda x: re.sub(regex, '[MASK]', x))
df.head()
pronouns = {'ACC_PRONOUN': ['her', 'him'],
'NOM_PRONOUN': ['she', 'he'],
'POSS_PRONOUN': ['her', 'his']}
res = {}
for key, value in pronouns.items():
res[key] = []
texts = list(df.loc[df.loc[:, 'pronoun'] == key, 'sentence_mask'].values)
res[key] += classifier(texts, targets=value, top_k=2)
probas = {}
for key, value in pronouns.items():
probas[key] = [x[0]['score'] / (x[0]['score'] + x[1]['score'])
if x[0]['token_str'] == value[0]
else x[1]['score'] / (x[0]['score'] + x[1]['score'])
for x in res[key]]
for key, value in pronouns.items():
df.loc[df.loc[:, 'pronoun'] == key, 'BERT-large-female'] = probas[key]
BERT¶
checkpoint="bert-base-uncased"
classifier = pipeline('fill-mask', model=checkpoint)
res = {}
probas = {}
for key, value in pronouns.items():
res[key] = []
texts = list(df.loc[df.loc[:, 'pronoun'] == key, 'sentence_mask'].values)
res[key] += classifier(texts, targets=value, top_k=2)
probas[key] = [x[0]['score'] / (x[0]['score'] + x[1]['score'])
if x[0]['token_str'] == value[0]
else x[1]['score'] / (x[0]['score'] + x[1]['score'])
for x in res[key]]
df.loc[df.loc[:, 'pronoun'] == key, 'BERT-female'] = probas[key]
distillBERT¶
checkpoint="distilbert-base-uncased"
classifier = pipeline('fill-mask', model=checkpoint)
res = {}
probas = {}
for key, value in pronouns.items():
res[key] = []
texts = list(df.loc[df.loc[:, 'pronoun'] == key, 'sentence_mask'].values)
res[key] += classifier(texts, targets=value, top_k=2)
probas[key] = [x[0]['score'] / (x[0]['score'] + x[1]['score'])
if x[0]['token_str'] == value[0]
else x[1]['score'] / (x[0]['score'] + x[1]['score'])
for x in res[key]]
df.loc[df.loc[:, 'pronoun'] == key, 'distilBERT-female'] = probas[key]
Visualize¶
probas = {}
for occ, part, answer, l, b, d in df.loc[:, ['occupation(0)', 'other-participant(1)', 'answer',
'BERT-large-female', 'BERT-female', 'distilBERT-female']].values:
p = occ if answer == 0 else part
if p in probas.keys():
probas[p].append([l, b, d])
else:
probas[p] = [[l, b, d]]
for key, value in probas.items():
probas[key] = np.mean(value, axis=0)
probas = sorted(probas.items(), key=lambda x: x[1][0], reverse=True)
fig, ax = plt.subplots(figsize=(30, 8))
ax.scatter(np.arange(len(probas)), [x[1][0] for x in probas], label='BERT-large')
ax.scatter(np.arange(len(probas)), [x[1][1] for x in probas], label='BERT')
ax.scatter(np.arange(len(probas)), [x[1][2] for x in probas], label='distilBERT')
ax.hlines(0.5, 0, len(probas), colors='g')
ax.set_xticks(np.arange(len(probas)))
ax.set_xticklabels([x[0] for x in probas], rotation=90)
ax.set_xlabel("occupations")
ax.set_ylabel("probability to be women")
ax.legend()
plt.show()