📝 Prompt Engineering for Toxic Comment Classification¶

In this notebook, we explore advanced prompt engineering techniques for text classification using LiteLLM (with OpenAI backends). We’ll work on the toxic comments dataset, building prompt-based classifiers that don’t require model fine-tuning.

📚 Dataset¶

Toxic Comments Dataset
This dataset consists of user-generated comments labeled as either toxic or non-toxic.
In this session, we’ll sample:
- 300 examples for demonstration (150 positive, 150 negative),
- 100 examples as a dev set to help us tune the prompts.

🎯 Prompt Engineering Techniques¶

We’ll experiment with several prompt engineering strategies:

Technique	Description
1️⃣ Prompt Only	Use a single, well-crafted prompt to classify the text.
2️⃣ Few-Shot Learning	Add a few labeled examples in the prompt to guide the model.
3️⃣ Chain-of-Thoughts	Encourage step-by-step reasoning in the prompt for better classification.
4️⃣ Automatic Prompt Engineering	Iteratively refine prompts based on output failures, inspired by the latest research (Zhou et al., 2022).

🧪 Evaluation Metrics¶

For each technique, we’ll measure:

✅ Accuracy (dev set),
✅ Qualitative insights (example outputs),
✅ Improvements through prompt iteration.

🔍 Learning Objectives¶

Understand how to craft and evaluate different prompt-based approaches.
Learn how few-shot examples and chain-of-thought reasoning can boost classification performance.
Explore automatic prompt optimization to maximize accuracy without fine-tuning.

Let’s get started by loading and sampling the dataset!

In [1]:

Copied!





from datasets import load_dataset
import pandas as pd
import numpy as np

# Load dataset
dataset = load_dataset("AiresPucrs/toxic-comments", split="train")

# Convert to pandas
df = dataset.to_pandas()

# Show dataset size and column names
print("Dataset size:", len(df))
print("Columns:", df.columns)

# Explore label distribution
print(df["toxic"].value_counts())

# Sample 150 toxic (toxic=1) and 150 non-toxic (toxic=0) for main demo set
df_toxic = df[df["toxic"] == 1].sample(150, random_state=42)
df_non_toxic = df[df["toxic"] == 0].sample(150, random_state=42)

df_demo = pd.concat([df_toxic, df_non_toxic]).sample(frac=1, random_state=42).reset_index(drop=True)

# Sample 100 examples as dev set (balanced)
df_toxic_dev = df[df["toxic"] == 1].drop(df_toxic.index).sample(50, random_state=42)
df_non_toxic_dev = df[df["toxic"] == 0].drop(df_non_toxic.index).sample(50, random_state=42)

df_dev = pd.concat([df_toxic_dev, df_non_toxic_dev]).sample(frac=1, random_state=42).reset_index(drop=True)

# Show basic stats
print("\nDemo set label distribution:\n", df_demo["toxic"].value_counts())
print("Dev set label distribution:\n", df_dev["toxic"].value_counts())

# Quick sample preview
df_demo.sample(5)
from datasets import load_dataset
import pandas as pd
import numpy as np

# Load dataset
dataset = load_dataset("AiresPucrs/toxic-comments", split="train")

# Convert to pandas
df = dataset.to_pandas()

# Show dataset size and column names
print("Dataset size:", len(df))
print("Columns:", df.columns)

# Explore label distribution
print(df["toxic"].value_counts())

# Sample 150 toxic (toxic=1) and 150 non-toxic (toxic=0) for main demo set
df_toxic = df[df["toxic"] == 1].sample(150, random_state=42)
df_non_toxic = df[df["toxic"] == 0].sample(150, random_state=42)

df_demo = pd.concat([df_toxic, df_non_toxic]).sample(frac=1, random_state=42).reset_index(drop=True)

# Sample 100 examples as dev set (balanced)
df_toxic_dev = df[df["toxic"] == 1].drop(df_toxic.index).sample(50, random_state=42)
df_non_toxic_dev = df[df["toxic"] == 0].drop(df_non_toxic.index).sample(50, random_state=42)

df_dev = pd.concat([df_toxic_dev, df_non_toxic_dev]).sample(frac=1, random_state=42).reset_index(drop=True)

# Show basic stats
print("\nDemo set label distribution:\n", df_demo["toxic"].value_counts())
print("Dev set label distribution:\n", df_dev["toxic"].value_counts())

# Quick sample preview
df_demo.sample(5)

Dataset size: 70157
Columns: Index(['comment_text', 'toxic'], dtype='object')
toxic
0    35080
1    35077
Name: count, dtype: int64

Demo set label distribution:
 toxic
0    150
1    150
Name: count, dtype: int64
Dev set label distribution:
 toxic
0    50
1    50
Name: count, dtype: int64

Out[1]:

	comment_text	toxic
92	hello please review discussion page talk shotg...	1
15	come sucka punch fucking face family son bitch...	0
294	course could try recreate scratch good sourcin...	1
153	another sockpuppet zay zay loose time editing ...	0
17	mean sprited dumb asses hope get guys name not...	0

🤖 Introduction to LiteLLM¶

For this session, we’re using LiteLLM, a powerful unified API for calling multiple LLM (Large Language Model) providers — including OpenAI, Mistral AI, Anthropic, Cohere, and more.

🚀 Why LiteLLM?¶

Traditionally, if you want to:

✅ Use different providers (OpenAI, Mistral AI, etc.),
✅ Test and compare them,
✅ Switch between model versions easily,

…you would need to write different API calls for each one — which can be time-consuming and error-prone.

🎯 How LiteLLM Solves This¶

With LiteLLM:

🪄 One unified interface for calling many providers:

from litellm import completion
response = completion(model="gpt-4o-mini", messages=messages)

🧩 Seamless switching: You can change "gpt-4o-mini" to "claude-3-opus" or "azure-gpt-4" with no other code changes.
🏃 Built-in rate limiting, retries, and error handling.
🔬 Transparent logging and tracing for reproducibility.

🌍 Why This Matters for Prompt Engineering¶

Prompt engineering is provider-agnostic — the techniques we’ll explore (like few-shot learning or chain-of-thought) work across LLMs.

Using LiteLLM lets us:

✅ Focus on prompt crafting, not on API differences.
✅ Quickly compare LLM behavior and performance.
✅ Prototype faster — perfect for rapid iteration!

🤖 LLMClient – Wrapping LLM-based Toxicity Classification¶

We'll use the LLMClient class to interface with LiteLLM and evaluate different prompt engineering strategies. Let’s break down what this class does:

🧩 Key Components¶

✅ Initialization (__init__)

Sets the prompt template to control the LLM’s instructions,
Chooses the model (here, gpt-4o-mini by default) and temperature (for creativity vs. reliability).

✅ .predict(comments)

For each comment in the input list:
- Fills in the prompt with the actual comment text.
- Calls the LLM (via LiteLLM) to generate a classification.
- Parses the JSON output to extract whether the comment is considered toxic (1) or non-toxic (0).

✅ .parse_answer(answer)

Uses regex to find the JSON part of the LLM’s output (between json … ),
This ensures we only read the structured part of the answer, even if the LLM adds extra commentary or explanations.

✅ .metric(y_true, y_pred)

Uses sklearn to calculate precision, recall, F1-score (macro),
Prints the results as a clean table for easy analysis.

✅ .analyze_error(y_true, y_pred)

Builds a confusion matrix to see where the LLM is making errors,
Lists false positives (non-toxic predicted as toxic) and false negatives (toxic predicted as non-toxic).

🛑 Why Constrain the Output to JSON?¶

LLMs often produce verbose natural language answers, which makes it hard to parse and use programmatically.

By explicitly instructing the LLM to output a valid JSON object (e.g., { "toxic": true }), we ensure:

🔍 Easier parsing (no ambiguity in answers),
✅ Consistency across runs,
⚡ Automation-friendly: critical when running many predictions!

This practice is especially important when using LLMs for automated pipelines or batch processing.

💡 Let’s now apply this class to different prompt engineering techniques to see how well each works for toxic comment detection!

In [2]:

Copied!





from litellm import completion
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
from jinja2 import Template
import json

class LLMClient:
    def __init__(self, prompt_template, model_name="gpt-4o-mini", temperature=0.2):
        self.model_name = model_name
        self.temperature = temperature
        self.prompt_template = Template(prompt_template)

    def parse_answer(self, answer):
        pattern = r"```json(.*?)```"
        matches = re.findall(pattern, answer, re.DOTALL)
        questions = json.loads(matches[0])
        return questions

    def predict(self, comments):
        predictions = []
        for comment in tqdm(comments):
            # Build prompt
            prompt = self.prompt_template.render(comment=comment)
            messages = [
                {"role": "system", "content": "You are an expert in detecting toxic content."},
                {"role": "user", "content": prompt}
            ]
            # Call LiteLLM
            response = completion(
                model=self.model_name,
                messages=messages,
                temperature=self.temperature
            )
            answer = response["choices"][0]["message"]["content"].lower().strip()
            try:
                answer = self.parse_answer(answer)
                answer = answer.get('toxic')
            except:
                answer = False
            
            if answer:
                predictions.append(1)
            else:
                predictions.append(0)
        return predictions

    def metric(self, y_true, y_pred):
        report = classification_report(y_true, y_pred, digits=4, output_dict=True)
        df = pd.DataFrame(report).transpose()
        print("\n📊 Classification Report:\n", df[["precision", "recall", "f1-score"]])
        return df

    def analyze_error(self, y_true, y_pred):
        cm = confusion_matrix(y_true, y_pred)
        df_cm = pd.DataFrame(cm, index=["Actual Non-toxic", "Actual Toxic"], columns=["Pred Non-toxic", "Pred Toxic"])
        print("\n🔍 Confusion Matrix:\n", df_cm)

        # False Positives / False Negatives
        false_positives = np.where((y_true == 0) & (y_pred == 1))[0]
        false_negatives = np.where((y_true == 1) & (y_pred == 0))[0]

        print(f"\n⚠️ False Positives: {false_positives.shape[0]}")
        print(f"⚠️ False Negatives: {false_negatives.shape[0]}")

        return false_positives, false_negatives
from litellm import completion
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
from jinja2 import Template
import json

class LLMClient:
    def __init__(self, prompt_template, model_name="gpt-4o-mini", temperature=0.2):
        self.model_name = model_name
        self.temperature = temperature
        self.prompt_template = Template(prompt_template)

    def parse_answer(self, answer):
        pattern = r"```json(.*?)```"
        matches = re.findall(pattern, answer, re.DOTALL)
        questions = json.loads(matches[0])
        return questions

    def predict(self, comments):
        predictions = []
        for comment in tqdm(comments):
            # Build prompt
            prompt = self.prompt_template.render(comment=comment)
            messages = [
                {"role": "system", "content": "You are an expert in detecting toxic content."},
                {"role": "user", "content": prompt}
            ]
            # Call LiteLLM
            response = completion(
                model=self.model_name,
                messages=messages,
                temperature=self.temperature
            )
            answer = response["choices"][0]["message"]["content"].lower().strip()
            try:
                answer = self.parse_answer(answer)
                answer = answer.get('toxic')
            except:
                answer = False
            
            if answer:
                predictions.append(1)
            else:
                predictions.append(0)
        return predictions

    def metric(self, y_true, y_pred):
        report = classification_report(y_true, y_pred, digits=4, output_dict=True)
        df = pd.DataFrame(report).transpose()
        print("\n📊 Classification Report:\n", df[["precision", "recall", "f1-score"]])
        return df

    def analyze_error(self, y_true, y_pred):
        cm = confusion_matrix(y_true, y_pred)
        df_cm = pd.DataFrame(cm, index=["Actual Non-toxic", "Actual Toxic"], columns=["Pred Non-toxic", "Pred Toxic"])
        print("\n🔍 Confusion Matrix:\n", df_cm)

        # False Positives / False Negatives
        false_positives = np.where((y_true == 0) & (y_pred == 1))[0]
        false_negatives = np.where((y_true == 1) & (y_pred == 0))[0]

        print(f"\n⚠️ False Positives: {false_positives.shape[0]}")
        print(f"⚠️ False Negatives: {false_negatives.shape[0]}")

        return false_positives, false_negatives

In [46]:

Copied!





prompt_template = """
You are a toxicity classifier. 
Your task is to determine if the following comment is toxic or non-toxic. Return only a JSON object with the following format:

```json
{
    "toxic": True/False
}
```

Comment:
{{comment}}

Answer:
"""
# Create instance
simple_prompt = LLMClient(prompt_template=prompt_template, 
                          model_name="gpt-4o-mini", 
                          temperature=0.2)


texts = df_demo["comment_text"].tolist()
labels = df_demo["toxic"].tolist()

predictions = simple_prompt.predict(texts)

# Metrics
simple_prompt.metric(1 - np.array(labels), np.array(predictions))

# Error analysis
simple_prompt.analyze_error(1 - np.array(labels), np.array(predictions))
prompt_template = """
You are a toxicity classifier. 
Your task is to determine if the following comment is toxic or non-toxic. Return only a JSON object with the following format:

```json
{
    "toxic": True/False
}
```

Comment:
{{comment}}

Answer:
"""
# Create instance
simple_prompt = LLMClient(prompt_template=prompt_template, 
                          model_name="gpt-4o-mini", 
                          temperature=0.2)


texts = df_demo["comment_text"].tolist()
labels = df_demo["toxic"].tolist()

predictions = simple_prompt.predict(texts)

# Metrics
simple_prompt.metric(1 - np.array(labels), np.array(predictions))

# Error analysis
simple_prompt.analyze_error(1 - np.array(labels), np.array(predictions))

Initial Prompt-Only Performance¶

Our Prompt-Only baseline for toxic comment classification using LiteLLM achieves:

Macro F1-score: ~0.92
Balanced precision and recall across both classes.

✅ The model performs well overall, but there are still:

14 false positives (non-toxic labeled as toxic),
9 false negatives (toxic labeled as non-toxic).

🎯 Next Step: Boosting with Few-Shot Examples¶

To further improve classification, especially for difficult or ambiguous cases, we’ll:

Use these false positives and false negatives from the dev set as few-shot examples,
Test if adding 1 or 2 examples per label helps the model clarify edge cases and boost performance.

Let’s dive in and see how few-shot learning can help us refine the classifier!

In [50]:

Copied!

dev_texts = df_dev["comment_text"].tolist()
dev_labels = df_dev["toxic"].tolist()

dev_predictions = simple_prompt.predict(dev_texts)

false_positives, false_negatives = simple_prompt.analyze_error(1 - np.array(dev_labels), np.array(dev_predictions))
dev_texts = df_dev["comment_text"].tolist()
dev_labels = df_dev["toxic"].tolist()

dev_predictions = simple_prompt.predict(dev_texts)

false_positives, false_negatives = simple_prompt.analyze_error(1 - np.array(dev_labels), np.array(dev_predictions))

🔍 Confusion Matrix:
                   Pred Non-toxic  Pred Toxic
Actual Non-toxic              43           7
Actual Toxic                   6          44

⚠️ False Positives: 7
⚠️ False Negatives: 6

In [55]:

Copied!





print("False Positives:")
for fp in false_positives:
    print(dev_texts[fp])
    print("-"*100)

print("False Negatives:")
for fn in false_negatives:
    print(dev_texts[fn])
    print("-"*100)
print("False Positives:")
for fp in false_positives:
    print(dev_texts[fp])
    print("-"*100)

print("False Negatives:")
for fn in false_negatives:
    print(dev_texts[fn])
    print("-"*100)

False Positives:
stop harassing complain user keegscee started attacking nowhere way around easy verify leave alone complain get blocked
----------------------------------------------------------------------------------------------------
nambiar looks obsessed associate certain terms professions ezhava community many ignorant people editing wikipedia articles doubt one among consider looking like dalit bad fair skin superior sick mind churn epithets want know looks need know first order tipu sultan gave nairs capturing malabar order orginal form seen mysore museum recently found one sitesyou read must proceed opposite manner dwell quietly pay dues like good subjects since practice one women associate ten men leave mothers sisters unconstrained obscene practices hence born adultery shameless connections heasts filed hereby require forsake sinful practices live like rest mankind disobedient commands made repeated vows honour whole islam march chief persons seat government make introspection tell good today tapping prostitution dalit looks european iranian looks suggest get basic knowledge history archeology anthropology etc think editing encyclopedia meant immature minds martial race british invention martial races intellectually inferior people would fight someone war get killed somebody one reasons pakistan invaded india times many martial tribes pakistan assumed defeat bigger army indian one pakistani equal hindus proven wrong noted nairs never challenged supremacy brahmins kashtriyas varma always obedient servants fulfilled every wish masters anil kumar
----------------------------------------------------------------------------------------------------
perhaps external link human_rights_and_equal_opportunity_commission page writings order demonstrate great example sort comment promotes racial religious intollerance strong feelings either way regarding external links
----------------------------------------------------------------------------------------------------
request member board trustees ask remove false misleading libellous information contained wikipedia remind continuing hold information libel illegal refer specifically webpage additionally talk pages user pages address blanked preserve integrity service failure would constitute smear campaign wikipedia whilst initiated persons longhair wikipedia hence sued result damages causes strongly recommend delete lies falsehoods
----------------------------------------------------------------------------------------------------
alpha rascals
----------------------------------------------------------------------------------------------------
wikipedia freedom speech reliable sources bunch jack asses wanna feel important imposing rules everyone wikipedia reliable gossip mag care guy says chinese democracy recorde source mistaken album recorded tossed slash left anyone likes band would know obviously controlfreak
----------------------------------------------------------------------------------------------------
know christopher columbus born november likes smell eat poop spanish people taught poop mini vans
----------------------------------------------------------------------------------------------------
False Negatives:
nah far ass sun shine
----------------------------------------------------------------------------------------------------
really strange much edits alike looks like sockpuppets
----------------------------------------------------------------------------------------------------
thirteen days life old fart
----------------------------------------------------------------------------------------------------
hypocrite liar merely thinking release sheltered wikipedia policy bound cocoon say floor
----------------------------------------------------------------------------------------------------
balance section people care think monkees abomination like fred cocklskew frampington said gives like think would good yeahhh
----------------------------------------------------------------------------------------------------
blocked marking link rot got blocked daring mark articles link rot problems blocked tagging problem could rest link rot specialists possibly come defence thanks much
----------------------------------------------------------------------------------------------------

🔍 Few-Shot Learning with Edge Case Examples¶

To further refine our LLM-based toxicity classification, we’ll use few-shot learning:

✅ We’ll provide 1 and 2 real examples from the dev set (using the false positives and false negatives identified earlier).
✅ Each example includes the comment and its true label (toxic or non-toxic).
✅ We’ll test if providing these examples helps the LLM generalize better to edge cases and reduce errors.

🧩 Our Few-Shot Prompt Format¶

Here’s the template for the few-shot prompts:

You are an expert classifier for toxic content.

Here are some examples:

Comment: "{comment_1}"
Answer: 
```json
{
    "toxic": true
}

Comment: "{comment_2}" Answer:

{
    "toxic": false
}

...

Comment: "{new_comment}"

Answer:

In [3]:

Copied!





prompt_template = """
You are a toxicity classifier. 
Your task is to determine if the following comment is toxic or non-toxic. Return only a JSON object with the following format:

```json
{
    "toxic": True/False
}
```

Here are some examples:

Comment: thirteen days life old fart
Answer: 
```json
{
    "toxic": true
}
```

Comment: stop harassing complain user keegscee started attacking nowhere way around easy verify leave alone complain get blocked
Answer: 
```json
{
    "toxic": false
}
```

Comment:
{{comment}}

Answer:
"""

texts = df_demo["comment_text"].tolist()
labels = df_demo["toxic"].tolist()

fsl_1_prompt = LLMClient(prompt_template=prompt_template, 
                          model_name="gpt-4o-mini", 
                          temperature=0.2)

fsl_1_predictions = fsl_1_prompt.predict(texts)
prompt_template = """
You are a toxicity classifier. 
Your task is to determine if the following comment is toxic or non-toxic. Return only a JSON object with the following format:

```json
{
    "toxic": True/False
}
```

Here are some examples:

Comment: thirteen days life old fart
Answer: 
```json
{
    "toxic": true
}
```

Comment: stop harassing complain user keegscee started attacking nowhere way around easy verify leave alone complain get blocked
Answer: 
```json
{
    "toxic": false
}
```

Comment:
{{comment}}

Answer:
"""

texts = df_demo["comment_text"].tolist()
labels = df_demo["toxic"].tolist()

fsl_1_prompt = LLMClient(prompt_template=prompt_template, 
                          model_name="gpt-4o-mini", 
                          temperature=0.2)

fsl_1_predictions = fsl_1_prompt.predict(texts)

100%|██████████| 300/300 [04:55<00:00,  1.01it/s]

In [4]:

Copied!





prompt_template = """
You are a toxicity classifier. 
Your task is to determine if the following comment is toxic or non-toxic. Return only a JSON object with the following format:

```json
{
    "toxic": True/False
}
```

Here are some examples:

Comment: thirteen days life old fart
Answer: 
```json
{
    "toxic": true
}
```

Comment: stop harassing complain user keegscee started attacking nowhere way around easy verify leave alone complain get blocked
Answer: 
```json
{
    "toxic": false
}
```

Comment: perhaps external link human_rights_and_equal_opportunity_commission page writings order demonstrate great example sort comment promotes racial religious intollerance strong feelings either way regarding external links
Answer: 
```json
{
    "toxic": false
}
```

Comment: hypocrite liar merely thinking release sheltered wikipedia policy bound cocoon say floor
Answer: 
```json
{
    "toxic": true
}
```

Comment:
{{comment}}

Answer:
"""
# Create instance
fsl_2_prompt = LLMClient(prompt_template=prompt_template, 
                          model_name="gpt-4o-mini", 
                          temperature=0.2)

fsl_2_predictions = fsl_2_prompt.predict(texts)
prompt_template = """
You are a toxicity classifier. 
Your task is to determine if the following comment is toxic or non-toxic. Return only a JSON object with the following format:

```json
{
    "toxic": True/False
}
```

Here are some examples:

Comment: thirteen days life old fart
Answer: 
```json
{
    "toxic": true
}
```

Comment: stop harassing complain user keegscee started attacking nowhere way around easy verify leave alone complain get blocked
Answer: 
```json
{
    "toxic": false
}
```

Comment: perhaps external link human_rights_and_equal_opportunity_commission page writings order demonstrate great example sort comment promotes racial religious intollerance strong feelings either way regarding external links
Answer: 
```json
{
    "toxic": false
}
```

Comment: hypocrite liar merely thinking release sheltered wikipedia policy bound cocoon say floor
Answer: 
```json
{
    "toxic": true
}
```

Comment:
{{comment}}

Answer:
"""
# Create instance
fsl_2_prompt = LLMClient(prompt_template=prompt_template, 
                          model_name="gpt-4o-mini", 
                          temperature=0.2)

fsl_2_predictions = fsl_2_prompt.predict(texts)

100%|██████████| 300/300 [04:58<00:00,  1.00it/s]

In [5]:

Copied!





# Metrics
fsl_1_prompt.metric(1 - np.array(labels), np.array(fsl_1_predictions))
fsl_2_prompt.metric(1 - np.array(labels), np.array(fsl_2_predictions))

# Error analysis
fsl_1_prompt.analyze_error(1 - np.array(labels), np.array(fsl_1_predictions))
fsl_2_prompt.analyze_error(1 - np.array(labels), np.array(fsl_2_predictions))
# Metrics
fsl_1_prompt.metric(1 - np.array(labels), np.array(fsl_1_predictions))
fsl_2_prompt.metric(1 - np.array(labels), np.array(fsl_2_predictions))

# Error analysis
fsl_1_prompt.analyze_error(1 - np.array(labels), np.array(fsl_1_predictions))
fsl_2_prompt.analyze_error(1 - np.array(labels), np.array(fsl_2_predictions))

📊 Classification Report:
               precision    recall  f1-score
0              0.946309  0.940000  0.943144
1              0.940397  0.946667  0.943522
accuracy       0.943333  0.943333  0.943333
macro avg      0.943353  0.943333  0.943333
weighted avg   0.943353  0.943333  0.943333

📊 Classification Report:
               precision  recall  f1-score
0              0.959184    0.94  0.949495
1              0.941176    0.96  0.950495
accuracy       0.950000    0.95  0.950000
macro avg      0.950180    0.95  0.949995
weighted avg   0.950180    0.95  0.949995

🔍 Confusion Matrix:
                   Pred Non-toxic  Pred Toxic
Actual Non-toxic             141           9
Actual Toxic                   8         142

⚠️ False Positives: 9
⚠️ False Negatives: 8

🔍 Confusion Matrix:
                   Pred Non-toxic  Pred Toxic
Actual Non-toxic             141           9
Actual Toxic                   6         144

⚠️ False Positives: 9
⚠️ False Negatives: 6

Out[5]:

(array([ 45,  84,  86,  87,  88, 120, 170, 198, 278]),
 array([ 12,  63, 144, 153, 193, 288]))

🔍 Impact of Adding Few-Shot Examples¶

By incorporating false positive and false negative examples as few-shot demonstrations in the prompt, we observe clear improvements in the model’s performance:

📈 Summary of Results¶

Few-Shot Examples	Macro F1	False Positives	False Negatives
Prompt-Only	0.9200	14	9
1 pair (2 examples)	0.9433	9	8
2 pairs (4 examples)	0.9500	9	6

🟢 Key Observations¶

✅ F1-score improved from ~0.92 (prompt-only) → 0.9433 (1 pair) → 0.95 (2 pairs).
✅ False negatives decreased from 9 → 8 → 6, indicating better coverage of toxic examples.
✅ False positives remained stable, showing more robust classification of non-toxic comments.

💡 Takeaway¶

Adding a few well-chosen examples helps the model:

Resolve ambiguities in edge cases,
Better understand nuances of toxic vs. non-toxic language,
And achieve better macro-average performance — critical for balanced datasets!

We would need to know when to stop adding more examples. Now let's try to add chain-of-thought reasoning to the mix and see if it helps.

🧠 Chain-of-Thought (CoT) Reasoning for Toxicity Classification¶

💡 What is Chain-of-Thought (CoT)?¶

Chain-of-Thought prompting is a powerful strategy that guides the LLM to explicitly reason through intermediate steps, rather than jumping straight to an answer.

Instead of giving the model a single classification instruction, we prompt it to explain its reasoning process step by step. This can:

✅ Help the model unpack subtle cues in language (like sarcasm, implied aggression),
✅ Reduce overconfidence in borderline cases,
✅ Improve robustness and consistency.

🔬 Applying CoT to Toxic Comment Classification¶

In the CoT version, we’ll:

Provide a short reasoning explanation for each example in the few-shot prompt,
Instruct the LLM to also explain its decision when analyzing new comments,
Still end the output with a JSON object for easy parsing!

This helps the LLM learn and apply a structured thought process to classify nuanced comments.

Let’s set it up!

In [6]:

Copied!





prompt_template_cot = """
You are a toxicity classifier. 
Your task is to determine if the following comment is toxic or non-toxic. 
First, provide a short reasoning about the comment's language and tone.
Finally, return only a JSON object with the following format:

```json
{
    "toxic": true/false
}
````

Here are some examples:

Comment: thirteen days life old fart
Reasoning: This comment uses derogatory language, indicating disrespectful and toxic behavior.
Answer:

```json
{
    "toxic": true
}
```

Comment: stop harassing complain user keegscee started attacking nowhere way around easy verify leave alone complain get blocked
Reasoning: This comment is a complaint about user interactions, but it does not contain toxic or aggressive language.
Answer:

```json
{
    "toxic": false
}
```

Comment: perhaps external link human_rights_and_equal_opportunity_commission page writings order demonstrate great example sort comment promotes racial religious intollerance strong feelings either way regarding external links
Reasoning: This comment references external links but does not show direct toxicity.
Answer:

```json
{
    "toxic": false
}
```

Comment: hypocrite liar merely thinking release sheltered wikipedia policy bound cocoon say floor
Reasoning: This comment calls someone a hypocrite and liar, using negative and toxic language.
Answer:

```json
{
    "toxic": true
}
```

Comment: {{comment}}
Reasoning:
"""


client_cot = LLMClient(prompt_template=prompt_template_cot, model_name="gpt-4o-mini", temperature=0.2)

# Evaluate on dev set
preds_cot = client_cot.predict(texts)

# Show metrics
client_cot.metric(1 - np.array(labels), np.array(preds_cot))

# Analyze errors
client_cot.analyze_error(1 - np.array(labels), np.array(preds_cot))
prompt_template_cot = """
You are a toxicity classifier. 
Your task is to determine if the following comment is toxic or non-toxic. 
First, provide a short reasoning about the comment's language and tone.
Finally, return only a JSON object with the following format:

```json
{
    "toxic": true/false
}
````

Here are some examples:

Comment: thirteen days life old fart
Reasoning: This comment uses derogatory language, indicating disrespectful and toxic behavior.
Answer:

```json
{
    "toxic": true
}
```

Comment: stop harassing complain user keegscee started attacking nowhere way around easy verify leave alone complain get blocked
Reasoning: This comment is a complaint about user interactions, but it does not contain toxic or aggressive language.
Answer:

```json
{
    "toxic": false
}
```

Comment: perhaps external link human_rights_and_equal_opportunity_commission page writings order demonstrate great example sort comment promotes racial religious intollerance strong feelings either way regarding external links
Reasoning: This comment references external links but does not show direct toxicity.
Answer:

```json
{
    "toxic": false
}
```

Comment: hypocrite liar merely thinking release sheltered wikipedia policy bound cocoon say floor
Reasoning: This comment calls someone a hypocrite and liar, using negative and toxic language.
Answer:

```json
{
    "toxic": true
}
```

Comment: {{comment}}
Reasoning:
"""


client_cot = LLMClient(prompt_template=prompt_template_cot, model_name="gpt-4o-mini", temperature=0.2)

# Evaluate on dev set
preds_cot = client_cot.predict(texts)

# Show metrics
client_cot.metric(1 - np.array(labels), np.array(preds_cot))

# Analyze errors
client_cot.analyze_error(1 - np.array(labels), np.array(preds_cot))

100%|██████████| 300/300 [07:30<00:00,  1.50s/it]

📊 Classification Report:
               precision    recall  f1-score
0              0.979452  0.953333  0.966216
1              0.954545  0.980000  0.967105
accuracy       0.966667  0.966667  0.966667
macro avg      0.966999  0.966667  0.966661
weighted avg   0.966999  0.966667  0.966661

🔍 Confusion Matrix:
                   Pred Non-toxic  Pred Toxic
Actual Non-toxic             143           7
Actual Toxic                   3         147

⚠️ False Positives: 7
⚠️ False Negatives: 3

Out[6]:

(array([ 45,  84,  86, 120, 170, 195, 257]), array([ 49,  63, 144]))

Our Chain-of-Thought (CoT) prompt, combining few-shot examples + explicit reasoning, has further boosted the model’s performance!

📈 Performance Summary¶

Technique	Macro F1	False Positives	False Negatives
Prompt-Only	~0.92	14	9
Few-Shot (1 pair)	~0.94	9	8
Few-Shot (2 pairs)	~0.95	9	6
Chain-of-Thought (2 pairs)	0.967	7	3

🔍 Key Observations¶

✅ Macro F1-score improved significantly — from ~0.92 (prompt-only) → 0.967 (CoT),
✅ False negatives reduced dramatically: from 9 → 3,
✅ False positives also decreased: from 14 → 7.

This shows that the explicit reasoning in CoT helps the model better:

Understand nuances of borderline or ambiguous comments,
Distinguish subtle forms of toxicity,
Achieve a more balanced classification across both classes.

🟡 Trade-off: Inference Time¶

While CoT offers strong performance gains, it does come with a trade-off:

⚠️ Longer Inference Time

CoT requires the model to generate a step-by-step reasoning process before outputting the JSON answer.
In real-world applications, this may impact latency for high-volume deployments.

📌 Next Steps & Robustness Check¶

To validate these gains and ensure they’re not just from random LLM variability:

🧪 Compare FP/FN differences in detail to ensure they’re truly “hard cases” now correctly handled.
🔁 Repeat experiments multiple times to measure standard deviation and confirm these improvements are robust.

This would give us more confidence in deploying this approach for real-world toxic content moderation!

🤖 Beyond Manual Crafting: Automatic Prompt Engineering (APE)¶

So far, we’ve manually:

✅ Crafted prompt-only and few-shot examples,
✅ Added chain-of-thought reasoning for more nuanced predictions,
✅ Observed clear improvements in both macro F1 and error reduction.

🟡 The Challenge¶

🧠 Manual prompt engineering can be time-consuming and subjective — it depends on human intuition and small-scale error analysis.

💡 But what if we could automate this process?

🌍 Enter APE – Automatic Prompt Engineering¶

The APE framework (Zhou et al., 2022) proposes:

1️⃣ Starting with an initial prompt,
2️⃣ Generating candidate prompts through variations and mutations,
3️⃣ Using LLM self-evaluation (or performance on a dev set) to identify the best-performing prompt,
4️⃣ Iteratively refining the prompt to optimize performance.

🔬 Benefits of APE¶

✅ Discover better prompts than what humans can manually guess,
✅ Continuously adapt prompts to new data, edge cases, or deployment needs,
✅ Combine robustness testing (standard deviation across runs) with prompt improvement.

📊 The Bigger Picture¶

To truly validate our best prompt:

We should compare it to other LLM-based classifiers (like Claude, Mistral, or open-source models).
And even benchmark against fine-tuned transformer models trained specifically on toxic comments.

This gives us the full picture:

Approach	What it’s good at
🟩 Prompt Engineering (LiteLLM)	Quick deployment, no training, flexible
🟦 Fine-tuned models	Fast inference, consistent performance
🟧 APE	Best of both worlds, adaptive prompts