Baseline with Regexes and spaCy for Spam Detection¶

In this notebook, we will:

Load a spam detection dataset from Hugging Face.
Split our data into train, dev, and test sets, and explain why we need all three.
Create a regex-based baseline pipeline:
- Build naive patterns from the train set.
- Evaluate on test set.
- Check results on dev set to find false positives/negatives.
- Update regex rules.
- Final metrics on the test set.
Build a spaCy pipeline for spam detection:
- Use token and phrase matchers.
- Repeat the same steps (train -> dev -> refine -> test).
Compare results between the improved regex approach and spaCy approach.

Setup and Imports¶

We’ll need:

datasets: To load the spam dataset.
scikit-learn: For splitting the dataset and computing metrics.
re (built-in): For regex-based matching.
spaCy: For token and phrase matchers.

Make sure to look at this link to install all the dependencies.

In [1]:

Copied!





# If you're in a local environment, uncomment the lines below:
# !poetry run python -m spacy download en_core_web_sm

import re
import spacy
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2000000  # in case we have large texts
# If you're in a local environment, uncomment the lines below:
# !poetry run python -m spacy download en_core_web_sm

import re
import spacy
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2000000  # in case we have large texts

1. Load the Dataset¶

We'll use NotShrirang/email-spam-filter. It's a dataset with email text labeled as spam or not spam.

In [2]:

Copied!

dataset = load_dataset("NotShrirang/email-spam-filter")
dataset
dataset = load_dataset("NotShrirang/email-spam-filter")
dataset

README.md:   0%|          | 0.00/113 [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/5.40M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5171 [00:00<?, ? examples/s]

Out[2]:

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'label', 'text', 'label_num'],
        num_rows: 5171
    })
})

We expect the dataset to have a train split by default, which we’ll further split into train, dev, and a final test. Alternatively, we can keep the existing train as a larger pool and create dev/test from it. Some datasets also come with separate test sets. We'll check what's available after loading.

In [3]:

Copied!

# We'll see the columns: we expect something like {'text': ..., 'spam': ...}.
dataset["train"].features
# We'll see the columns: we expect something like {'text': ..., 'spam': ...}.
dataset["train"].features

Out[3]:

{'Unnamed: 0': Value(dtype='int64', id=None),
 'label': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None),
 'label_num': Value(dtype='int64', id=None)}

2. Create Train/Dev/Test Splits¶

Why do we need a dev set in addition to a train/test set?

Train set: used to fit our model (or in this case, develop our regex/spaCy patterns).
Dev (validation) set: used to tweak or refine patterns, hyperparameters, etc., without touching the final test. This prevents overfitting on the test set.
Test set: final unbiased evaluation.

If we only had train/test, we might continually adjust our method to do better on the test set, inadvertently tuning to that test distribution. The dev set helps keep the test set "truly" unseen.

In [4]:

Copied!

df_data = dataset["train"].to_pandas()
df_data.head()
df_data = dataset["train"].to_pandas()
df_data.head()

Out[4]:

	Unnamed: 0	label	text	label_num
0	605	ham	Subject: enron methanol ; meter # : 988291\nth...	0
1	2349	ham	Subject: hpl nom for january 9 , 2001\n( see a...	0
2	3624	ham	Subject: neon retreat\nho ho ho , we ' re arou...	0
3	4685	spam	Subject: photoshop , windows , office . cheap ...	1
4	2030	ham	Subject: re : indian springs\nthis deal is to ...	0

In [5]:

Copied!





# We'll do a 60/20/20 split from the single 'train' dataset.
df_train, df_temp = train_test_split(df_data, test_size=0.4, stratify=df_data["label"], random_state=42)
df_dev, df_test = train_test_split(df_temp, test_size=0.5, stratify=df_temp["label"], random_state=42)

print("Train size:", len(df_train))
print("Dev size:  ", len(df_dev))
print("Test size: ", len(df_test))
# We'll do a 60/20/20 split from the single 'train' dataset.
df_train, df_temp = train_test_split(df_data, test_size=0.4, stratify=df_data["label"], random_state=42)
df_dev, df_test = train_test_split(df_temp, test_size=0.5, stratify=df_temp["label"], random_state=42)

print("Train size:", len(df_train))
print("Dev size:  ", len(df_dev))
print("Test size: ", len(df_test))

Train size: 3102
Dev size:   1034
Test size:  1035

Now we have 3 separate splits. We'll define some helper functions for evaluation.

In [6]:

Copied!





def compute_metrics(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, pos_label=1)
    rec = recall_score(y_true, y_pred, pos_label=1)
    f1 = f1_score(y_true, y_pred, pos_label=1)
    return {
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1": f1
    }

def print_metrics(metrics_dict, prefix=""):
    print(f"{prefix} Accuracy:  {metrics_dict['accuracy']*100:.2f}%")
    print(f"{prefix} Precision: {metrics_dict['precision']*100:.2f}%")
    print(f"{prefix} Recall:    {metrics_dict['recall']*100:.2f}%")
    print(f"{prefix} F1-score:  {metrics_dict['f1']*100:.2f}%\n")
def compute_metrics(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, pos_label=1)
    rec = recall_score(y_true, y_pred, pos_label=1)
    f1 = f1_score(y_true, y_pred, pos_label=1)
    return {
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1": f1
    }

def print_metrics(metrics_dict, prefix=""):
    print(f"{prefix} Accuracy:  {metrics_dict['accuracy']*100:.2f}%")
    print(f"{prefix} Precision: {metrics_dict['precision']*100:.2f}%")
    print(f"{prefix} Recall:    {metrics_dict['recall']*100:.2f}%")
    print(f"{prefix} F1-score:  {metrics_dict['f1']*100:.2f}%\n")

3. Regex-Based Baseline¶

3a. Create a first naive pipeline¶

We’ll look at the train set to find some potential spam indicators. Typically, spam might have words like free, win, urgent, congratulations, etc. This is just a guess. In a real scenario, you’d examine the train data more carefully.

In [7]:

Copied!





# Let's gather some 'spammy' tokens from the train set by naive frequency analysis.
# We'll do a quick check of most common words in spam vs. not spam.

import collections

spam_texts = df_train[df_train["label"] == "spam"]["text"].values
ham_texts = df_train[df_train["label"] == "ham"]["text"].values

def tokenize(text):
    return re.findall(r"\w+", text.lower())

spam_words = []
for txt in spam_texts:
    spam_words.extend(tokenize(txt))

spam_counter = collections.Counter(spam_words)
spam_most_common = spam_counter.most_common(20)
spam_most_common
# Let's gather some 'spammy' tokens from the train set by naive frequency analysis.
# We'll do a quick check of most common words in spam vs. not spam.

import collections

spam_texts = df_train[df_train["label"] == "spam"]["text"].values
ham_texts = df_train[df_train["label"] == "ham"]["text"].values

def tokenize(text):
    return re.findall(r"\w+", text.lower())

spam_words = []
for txt in spam_texts:
    spam_words.extend(tokenize(txt))

spam_counter = collections.Counter(spam_words)
spam_most_common = spam_counter.most_common(20)
spam_most_common

Out[7]:

[('the', 4778),
 ('to', 3356),
 ('and', 3123),
 ('of', 2967),
 ('a', 2402),
 ('in', 2041),
 ('you', 1744),
 ('for', 1659),
 ('this', 1519),
 ('is', 1476),
 ('your', 1246),
 ('subject', 1000),
 ('with', 939),
 ('3', 918),
 ('that', 874),
 ('or', 869),
 ('on', 850),
 ('s', 848),
 ('be', 842),
 ('as', 766)]

We clearly see a lot of common words in the spam emails. "The", "of", ... stop words in English. Let's get rid of them. I imagine there are a lot of numbers and punctuation as well. Let's get rid of them too.

In [8]:

Copied!





from spacy.lang.en.stop_words import STOP_WORDS
import string

punctuation = string.punctuation
numbers = string.digits

stop_words = set(STOP_WORDS)

spam_words = []
for txt in spam_texts:
    for word in tokenize(txt):
        if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
            spam_words.append(word)

spam_counter = collections.Counter(spam_words)
spam_most_common = spam_counter.most_common(20)
spam_most_common
from spacy.lang.en.stop_words import STOP_WORDS
import string

punctuation = string.punctuation
numbers = string.digits

stop_words = set(STOP_WORDS)

spam_words = []
for txt in spam_texts:
    for word in tokenize(txt):
        if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
            spam_words.append(word)

spam_counter = collections.Counter(spam_words)
spam_most_common = spam_counter.most_common(20)
spam_most_common

Out[8]:

[('subject', 1000),
 ('company', 506),
 ('http', 460),
 ('information', 361),
 ('statements', 312),
 ('price', 310),
 ('email', 277),
 ('pills', 258),
 ('time', 241),
 ('font', 214),
 ('free', 211),
 ('message', 194),
 ('investment', 194),
 ('stock', 187),
 ('money', 184),
 ('business', 184),
 ('securities', 179),
 ('report', 176),
 ('2004', 174),
 ('contact', 172)]

We'll pick a few frequent tokens as naive spam triggers. (In reality, you'd do more thorough exploration or use a more advanced approach—but let's keep it simple for demonstration.)

In [9]:

Copied!





# Let's define a basic regex pattern that flags emails containing typical spammy words.
spam_keywords = ["free", "http", "www", "money", 
                 "win", "winner", "congratulations", 
                 "urgent", "claim", "prize", "click",
                 "price"]
pattern = re.compile(r"(" + "|".join(spam_keywords) + r")", re.IGNORECASE)

def regex_spam_classifier(text):
    if pattern.search(text):
        return 1  # spam
    return 0     # not spam
# Let's define a basic regex pattern that flags emails containing typical spammy words.
spam_keywords = ["free", "http", "www", "money", 
                 "win", "winner", "congratulations", 
                 "urgent", "claim", "prize", "click",
                 "price"]
pattern = re.compile(r"(" + "|".join(spam_keywords) + r")", re.IGNORECASE)

def regex_spam_classifier(text):
    if pattern.search(text):
        return 1  # spam
    return 0     # not spam

3b. Get metrics on the test set¶

Even though we said we’d refine on dev, let’s see how it does out-of-the-box on the test set. (Sometimes it’s informative to check a naive baseline right away.)

In [10]:

Copied!

y_test_true = df_test["label_num"].values
y_test_pred = [regex_spam_classifier(txt) for txt in df_test["text"].values]

test_metrics = compute_metrics(y_test_true, y_test_pred)
print_metrics(test_metrics, prefix="Regex Baseline (Test) ")
y_test_true = df_test["label_num"].values
y_test_pred = [regex_spam_classifier(txt) for txt in df_test["text"].values]

test_metrics = compute_metrics(y_test_true, y_test_pred)
print_metrics(test_metrics, prefix="Regex Baseline (Test) ")

Regex Baseline (Test)  Accuracy:  68.79%
Regex Baseline (Test)  Precision: 47.42%
Regex Baseline (Test)  Recall:    70.33%
Regex Baseline (Test)  F1-score:  56.64%

Okay, not so bad, we get 70% of the spam emails, but we also have a lot of false positives almost 50% of our predictions are false positives !!

3c. Check dev set, find false positives & negatives¶

Let’s see how many spam messages were missed (false negatives) and how many ham messages were flagged as spam (false positives) on the dev set.

In [11]:

Copied!





y_dev_true = df_dev["label_num"].values
texts_dev = df_dev["text"].values

y_dev_pred = [regex_spam_classifier(txt) for txt in texts_dev]
dev_metrics = compute_metrics(y_dev_true, y_dev_pred)
print_metrics(dev_metrics, prefix="Regex Baseline (Dev) ")

# Let's identify the false positives and negatives.
fp_indices = []  # predicted spam but actually ham
fn_indices = []  # predicted ham but actually spam

for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred)):
    if gold == 0 and pred == 1:
        fp_indices.append(i)
    elif gold == 1 and pred == 0:
        fn_indices.append(i)

print("False Positives:", len(fp_indices), "examples")
print("False Negatives:", len(fn_indices), "examples")
y_dev_true = df_dev["label_num"].values
texts_dev = df_dev["text"].values

y_dev_pred = [regex_spam_classifier(txt) for txt in texts_dev]
dev_metrics = compute_metrics(y_dev_true, y_dev_pred)
print_metrics(dev_metrics, prefix="Regex Baseline (Dev) ")

# Let's identify the false positives and negatives.
fp_indices = []  # predicted spam but actually ham
fn_indices = []  # predicted ham but actually spam

for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred)):
    if gold == 0 and pred == 1:
        fp_indices.append(i)
    elif gold == 1 and pred == 0:
        fn_indices.append(i)

print("False Positives:", len(fp_indices), "examples")
print("False Negatives:", len(fn_indices), "examples")

Regex Baseline (Dev)  Accuracy:  67.41%
Regex Baseline (Dev)  Precision: 45.77%
Regex Baseline (Dev)  Recall:    66.67%
Regex Baseline (Dev)  F1-score:  54.27%

False Positives: 237 examples
False Negatives: 100 examples

First thing is that the metrics are quite similar from the test set. Which means that both sets may be similar. Therefore if we find a way to improve on the dev set, we should see an improvement on the test set.

We clearly see that we have a lot of false positives, also a significant number of false negatives. Therefore first, we may want to cover more cases and then create some other rules to reduce the number of false positives.

3d. Analyze FN to improve regex¶

Let's first take a look at the false negatives to see if we can improve the regex.

In [12]:

Copied!





print("\n--- Some False Negatives ---\n")
for idx in fn_indices[:20]:
    print("DEV INDEX:", idx)
    print(texts_dev[idx][:300], "...")
    print("---")
print("\n--- Some False Negatives ---\n")
for idx in fn_indices[:20]:
    print("DEV INDEX:", idx)
    print(texts_dev[idx][:300], "...")
    print("---")

--- Some False Negatives ---

DEV INDEX: 17
Subject: fw : old aged mmomy wants a date
hey , man ! : )
dovizhdane
 ...
---
DEV INDEX: 22
Subject: prescription medication delivered overnight . p . n . termin , valium , + xanax + available . ki 80 hzhrb 5 if
we believe ordering medication should be as simple as ordering anything else on the internet : private , secure , and easy .
always available : \ xana : x : # vlagr @ , | vialium ^ ...
---
DEV INDEX: 30
Subject: 6 et vi - codin le 6 ally baronial fy dmabi
hey there ,
ofore
phacy
specials on :
viin , van - ax , vi - are
tariff
pleaove
me
taunt
accompaniment
yjhanl pactwmtnfbiiw pl ym romjr
jco jbxdlnvtwszthg
njrjduhen d yfwvg lrn ...
---
DEV INDEX: 44
Subject: story - my daughter isn ' t in pain anymore
newsweek medical : are you in pain ?
comparison finalists
no more
crave persuasivehave penis worldbodyguard
lackey coupeglutamine escape morphinefisherman
cryptanalytic stokecellar algonquin bewitchcatnip
complicate alkalinedalton kafkaesque gigab ...
---
DEV INDEX: 57
Subject: whats the word . order your prescr - iption ' s here . . adams gettysburg
 . . . . pertinaciousd ' . . . .
rawmnv ...
---
DEV INDEX: 68
Subject: how to earn thousands writing google adwords part - time kara
googlecash gives you all the tools you need to turn the search engine google . com into
an autopilot cash generating machine !
what ' s your dream lifestyle ?
phosphor disco ghoulish eardrum airplane geriatric approximant drop co ...
---
DEV INDEX: 72
Subject: microsoft update warning - january 7 th
minnesota , which can clinch a wild - card
playoff spot with a loss by either carolina or st . louis this weekend , appeared on
its way to retaking the lead . but a holding penalty on birk - - the vikings were
flagged nine times for 78 yards - - wiped ...
---
DEV INDEX: 77
Subject: young pussies
tonya could feel the glow of the hundreds of candles on her bare skin .
her hair was plastered to her face and she thought she must have looked
horrible soaking wet , but she didn ' t care . gabriel thought she was beautiful
and that was all she needed to know . tonya slid tow ...
---
DEV INDEX: 90
Subject: the only smart way to control spam
hey , i have a special _ offer for you . . .
better than all other spam filters -
only delivers the email you want !
this is the ultimate solution that is guaranteed to stop all spam
without
losing any of your important email ! this system protects you 100 ...
---
DEV INDEX: 108
Subject: can jim come over and watch ?
up to 80 %
savings on
xanax ,
valium ,
codeine ,
viagra
and moretry us out here
for email
removal ,
go here .
jewelry elsinore chairperson ameslan decorticate badge foam cutler zinc shopkeep cylinder oracle alcove steppe inefficacy skeleton quartic wasp compagn ...
---
DEV INDEX: 114
Subject: greatly improve your stamina
i ' ve been using your product for 4 months now . i ' ve increased my
length from 2
to nearly 6 . your product has saved my sex life . - matt , fl
my girlfriend loves the results , but she doesn ' t know what i do . she
thinks
it ' s natural - thomas , ca
pleasu ...
---
DEV INDEX: 116
Subject: cheap soft viagra
viagra soft tabs : perfect feeling of being men again .
starts working within just 15 minutes .
soft tabs :
info site
you take a candy and get hard rock erection .
this is not miracle . this is just soft tabs .
remove your email ...
---
DEV INDEX: 131
Subject: penls enlarg 3 ment pllls
enlarge your penls nowcllck h 3 re !
no more ...
---
DEV INDEX: 161
Subject: antigen downstairs dance
still no luck enlarging it ?
our 2 products will work for you !
1 . # 1 supplement available ! - works !
for vprx ciilck here
and
2 . * new * enhancement oil - get hard in 60 seconds ! amazing !
like no other oil you ' ve seen .
for vprx oil ciilck here
the 2 produc ...
---
DEV INDEX: 167
Subject: new details id : 21195
get your u n ive
rsi t y d i
plom acal 1 this number : 206
- 424 - 1596 ( anytime )
there are no required tests , class e s , books , or
interviews !
get a b a chelors , masters , m ba , and d
o ctorate ( phd ) d i
ploma ! receive the benefits and admiration
that come ...
---
DEV INDEX: 188
Subject: welcome to toronto pharmac euticals , the net ' s most secure source for presc ription medicines made in the usa . . coincide confrere
you can finally get * real * pain medic ation that works .
we receive your orders and one of our 24 x 7 onboard us physicians will approve of your order ( 9 ...
---
DEV INDEX: 202
Subject: inexplicable crying spells , sadness and / or irritability
- - - - 22037566626923367
hi varou ,
setting small , achievable goals will help will take you farther than you can imagine over time . it will help you reach your final destination : a happier , low - anxiety life .
we offer some of ...
---
DEV INDEX: 205
Subject: quick , easy vicodin w lij tirb xhzcixu
we are the only source for vicodin online !
- very easy ordering
- no prior prescription needed
- quick delivery
- inexpensive
 ...
---
DEV INDEX: 213
Subject: super cheap rates on best sexual health drug !
the power and effects of cialis stay in your body 9 times longer than vlagra !
save up to 60 % - order generic cialis today !
now they ' re chewable . like soft candy !
excalibu snuffybeautifu roy taffy daddy birdturbo abby cookies volley prope ...
---
DEV INDEX: 216
Subject: lower lipids and lower risk for heart disease langley
some hills are never seenthe universe is
expanding
album : good stufftitle : bad influence call it
bad big town holds me backbig town skinns my mind
sorry to have troubled you ; but it couldn ' t
be helped
bolan kerlaugir xmo 3 reginlejf ...
---

Okay looks interesting, maybe let's look for words that appear in the false negatives but not in the false positives.

In [13]:

Copied!





# Let's look for words that appear a lot in the false negatives but not so much in the false positives.
# Let's use collections to count the words in the false negatives and false positives.
# We'll get rid of stop words, punctuation and numbers.

fn_words = []
for idx in fn_indices:
    for word in tokenize(texts_dev[idx]):
        if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
            fn_words.append(word)

fp_words = []
for idx in fp_indices:
    for word in tokenize(texts_dev[idx]):
        if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
            fp_words.append(word)


fn_counter = collections.Counter(fn_words)
fp_counter = collections.Counter(fp_words)

# Let's create a ratio of occurences in the false negatives over the false positives.
fn_ratio = {word: fn_counter.get(word, 0) / (fp_counter.get(word, 0) + fn_counter.get(word, 0)) 
            for word in fn_counter if fp_counter.get(word, 0) + fn_counter.get(word, 0) > 4}

#Let's sort the words by the ratio.
fn_ratio = sorted(fn_ratio.items(), key=lambda x: x[1], reverse=True)

#Let's print the words that appear a lot in the false negatives but not so much in the false positives.
for word, ratio in fn_ratio[:50]:
    print(word, ratio)
# Let's look for words that appear a lot in the false negatives but not so much in the false positives.
# Let's use collections to count the words in the false negatives and false positives.
# We'll get rid of stop words, punctuation and numbers.

fn_words = []
for idx in fn_indices:
    for word in tokenize(texts_dev[idx]):
        if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
            fn_words.append(word)

fp_words = []
for idx in fp_indices:
    for word in tokenize(texts_dev[idx]):
        if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
            fp_words.append(word)


fn_counter = collections.Counter(fn_words)
fp_counter = collections.Counter(fp_words)

# Let's create a ratio of occurences in the false negatives over the false positives.
fn_ratio = {word: fn_counter.get(word, 0) / (fp_counter.get(word, 0) + fn_counter.get(word, 0)) 
            for word in fn_counter if fp_counter.get(word, 0) + fn_counter.get(word, 0) > 4}

#Let's sort the words by the ratio.
fn_ratio = sorted(fn_ratio.items(), key=lambda x: x[1], reverse=True)

#Let's print the words that appear a lot in the false negatives but not so much in the false positives.
for word, ratio in fn_ratio[:50]:
    print(word, ratio)

medications 1.0
palestinian 1.0
viagra 1.0
cheap 1.0
soft 1.0
minutes 1.0
vicodin 1.0
cialis 1.0
doctor 1.0
blood 1.0
loading 1.0
csgu 1.0
prescription 0.9090909090909091
spam 0.8888888888888888
stop 0.8888888888888888
sources 0.875
generic 0.8
military 0.8
rock 0.8
approved 0.8
sound 0.8
mobile 0.7777777777777778
ordering 0.75
story 0.6666666666666666
tabs 0.6666666666666666
lady 0.6666666666666666
video 0.625
waiting 0.625
remove 0.6153846153846154
attack 0.6
inside 0.6
international 0.6
friend 0.6
street 0.6
took 0.6
secure 0.5714285714285714
quick 0.5454545454545454
turn 0.5
clear 0.5
hard 0.5
real 0.5
quality 0.5
software 0.5
paper 0.5
short 0.5
credit 0.46153846153846156
enjoy 0.4444444444444444
said 0.4444444444444444
town 0.42857142857142855
case 0.42857142857142855

Well looks like we have some interesting words there. Let's add them to the regex. We do it dumb way here, but in practice we should explore a bit more.

In [14]:

Copied!





spam_keywords = ["free", "http", "www", "money", 
                 "win", "winner", "congratulations", 
                 "urgent", "claim", "prize", "click",
                 "price", "viagra", "vialium", "medication",
                 "aged", "xana", "xanax", "asyc", "cheap", 
                 "palestinian", "blood", "doctor", "cialis", 
                 "minutes", "vicodin", "soft", "loading", 
                 "csgu", "medications", "prescription", "spam", "stop"]
pattern = re.compile(r"(" + "|".join(spam_keywords) + r")", re.IGNORECASE)

def regex_spam_classifier_v0_2(text):
    if pattern.search(text):
        return 1  # spam
    return 0     # not spam
spam_keywords = ["free", "http", "www", "money", 
                 "win", "winner", "congratulations", 
                 "urgent", "claim", "prize", "click",
                 "price", "viagra", "vialium", "medication",
                 "aged", "xana", "xanax", "asyc", "cheap", 
                 "palestinian", "blood", "doctor", "cialis", 
                 "minutes", "vicodin", "soft", "loading", 
                 "csgu", "medications", "prescription", "spam", "stop"]
pattern = re.compile(r"(" + "|".join(spam_keywords) + r")", re.IGNORECASE)

def regex_spam_classifier_v0_2(text):
    if pattern.search(text):
        return 1  # spam
    return 0     # not spam

In [15]:

Copied!

y_test_true = df_test["label_num"].values
y_test_pred = [regex_spam_classifier_v0_2(txt) for txt in df_test["text"].values]

test_metrics = compute_metrics(y_test_true, y_test_pred)
print_metrics(test_metrics, prefix="Regex Baseline (Test) ")
y_test_true = df_test["label_num"].values
y_test_pred = [regex_spam_classifier_v0_2(txt) for txt in df_test["text"].values]

test_metrics = compute_metrics(y_test_true, y_test_pred)
print_metrics(test_metrics, prefix="Regex Baseline (Test) ")

Regex Baseline (Test)  Accuracy:  70.14%
Regex Baseline (Test)  Precision: 49.08%
Regex Baseline (Test)  Recall:    80.33%
Regex Baseline (Test)  F1-score:  60.94%

Incredible, meaning that just by adding a few words we get a huge improvement in the metrics (+10% of recall!) and the precision is still more or less the same.

3e. Analyze FP to improve regex¶

Let's do the same for the false positives. Meaning that we will find words that appear a lot in the false positives but not so much in the false negatives. If the message is detected as spam, we will apply another regex to check if it contains any of the words in the false positives. If it does, we will label it as ham.

First let's check the dev set false positives.

In [16]:

Copied!





y_dev_true = df_dev["label_num"].values
texts_dev = df_dev["text"].values

y_dev_pred = [regex_spam_classifier_v0_2(txt) for txt in texts_dev]
dev_metrics = compute_metrics(y_dev_true, y_dev_pred)
print_metrics(dev_metrics, prefix="Regex Baseline (Dev) ")

# Let's identify the false positives and negatives.
fp_indices = []  # predicted spam but actually ham
fn_indices = []  # predicted ham but actually spam

for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred)):
    if gold == 0 and pred == 1:
        fp_indices.append(i)
    elif gold == 1 and pred == 0:
        fn_indices.append(i)

print("False Positives:", len(fp_indices), "examples")
print("False Negatives:", len(fn_indices), "examples")
y_dev_true = df_dev["label_num"].values
texts_dev = df_dev["text"].values

y_dev_pred = [regex_spam_classifier_v0_2(txt) for txt in texts_dev]
dev_metrics = compute_metrics(y_dev_true, y_dev_pred)
print_metrics(dev_metrics, prefix="Regex Baseline (Dev) ")

# Let's identify the false positives and negatives.
fp_indices = []  # predicted spam but actually ham
fn_indices = []  # predicted ham but actually spam

for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred)):
    if gold == 0 and pred == 1:
        fp_indices.append(i)
    elif gold == 1 and pred == 0:
        fn_indices.append(i)

print("False Positives:", len(fp_indices), "examples")
print("False Negatives:", len(fn_indices), "examples")

Regex Baseline (Dev)  Accuracy:  70.50%
Regex Baseline (Dev)  Precision: 49.49%
Regex Baseline (Dev)  Recall:    81.00%
Regex Baseline (Dev)  F1-score:  61.44%

False Positives: 248 examples
False Negatives: 57 examples

We see that we reduced by two the number of false negatives. Let's see if we can reduce the number of false positives.

In [17]:

Copied!





print("\n--- Some False Positives ---\n")
for idx in fp_indices[:20]:
    print("DEV INDEX:", idx)
    print(texts_dev[idx][:300], "...")
    print("---")
print("\n--- Some False Positives ---\n")
for idx in fp_indices[:20]:
    print("DEV INDEX:", idx)
    print(texts_dev[idx][:300], "...")
    print("---")

--- Some False Positives ---

DEV INDEX: 1
Subject: playgroup pictures from houston cow parade
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ? easy unsubscribe click here : http : / / topica . com / u / ? a 84 vnf . a 9 ivhm or send an email to : brcc . yf  ...
---
DEV INDEX: 2
Subject: re : united oil & minerals , inc . , chapman unit # 1
vance ,
deal # 357904 has been created and entered in sitara .
bob
vance l taylor
08 / 04 / 2000 04 : 06 pm
to : robert cotten / hou / ect @ ect , hillary mack / corp / enron @ enron , lisa
hesse / hou / ect @ ect , trisha hughes / hou / ...
---
DEV INDEX: 9
Subject: re : spinnaker exploration company , l . l . c . n . padre is . block 883 l
offshore kleberg county , texas contract 96047295 , meter 098 - 9862 ( 098 - 9848
platform )
thanks , bob . it now turns out that due to operational issues , the additional 10 , 000 / d may not come on next week . s ...
---
DEV INDEX: 12
Subject: enronoptions update !
enronoptions announcement
we have updated the enronoptions  ) your stock option program web site ! the
web site now contains specific details of the enronoptions program including
the december 29 , 2000 grant price and additional information on employee
eligibility .
 ...
---
DEV INDEX: 13
Subject: panenergy marketing march 2000 production
deal # 157288
per our conversation yesterday afternoon , pls . separate the centena term deal
from the spot deal in sitara for march 2000 production .
also , i need to have the price for the east texas redelivery changed in
sitara from hs index $ -  ...
---
DEV INDEX: 24
Subject: re : april spot tickets
the spot deals are in and the deal numbers are added below to the original
notice .
vance l taylor @ ect
03 / 28 / 2000 01 : 40 pm
to : tom acton / corp / enron @ enron
cc : carlos j rodriguez / hou / ect @ ect , lisa hesse / hou / ect @ ect , susan
smith / hou / ect ...
---
DEV INDEX: 26
Subject: good friday
fyi - the risk team will not be in the office on friday . pat is evaluating
the situation currently , and will decide later this week . let me know if you
have any questions or concerns .
- - - - - - - - - - - - - - - - - - - - - - forwarded by brenda f herod / hou / ect on 04 / ...
---
DEV INDEX: 28
Subject: re : hpl discrepancy
hey clem can you help us out with this one ? what are the volumes and deal
tickets in question for those two days and what is the location ? we
delivered to you at centana and enerfin . didn ' t we have that famous
hpl / tetco oba already set up to handle the small volu ...
---
DEV INDEX: 32
Subject: volume feedback from unify to sitara
fyi : the following is the unify to sitara bridge back schedule from the sitara team . unify can still send the files to sitara but sitara will not process during the " no bridge back " times listed . this list is in response to several inquiries to me r ...
---
DEV INDEX: 37
Subject: epgt
mike :
i am down to the last few error messages on the epgt quick response and also looking into the external pool for who 34 in unify . about 70 % of the line items have been cleaned up .
i need the following information from you as soon as possible .
1 . the downstream contract numbe ...
---
DEV INDEX: 38
Subject: re : tenaska iv 10 / 00
darren ,
the demand fee is probably the best solution . we can use it to create a
recieivable / payable with tenaska , depending on which way the calculation goes
each month . how are pma ' s to be handled once the fee been calculated and the
deal put in the system ? ...
---
DEV INDEX: 43
Subject: re : first delivery - cummings & walker and exxon
vance ,
deal # 446704 has been created and entered in sitara for cummins & walker oil
company inc . for the period 9 / 26 / 00 - 9 / 30 / 00 .
bob
vance l taylor
10 / 20 / 2000 04 : 17 pm
to : robert cotten / hou / ect @ ect
cc : lisa hesse  ...
---
DEV INDEX: 66
Subject: fw : txu fuel deals imbalances
daren ,
the deals listed below are related to tufco imbalances . . . let me know if you have any objections to me entering the deals . . . o ' neal 3 - 9686
- - - - - original message - - - - -
from : griffin , rebecca
sent : thursday , june 28 , 2001 9 : 58 a ...
---
DEV INDEX: 67
Subject: pat - out for jury duty
i am out of the office on monday for jury duty . in my absence , charlotte
hawkins will be the contact for the texas desk
logistics group . she will attend any meetings while i am out and is
responsible for our group ( we will rotate this backup
role among the senior ...
---
DEV INDEX: 69
Subject: urgent
ed has requested that we compile a list this morning of all parties / points which we owe gas to , in the event that we need to find a home for excess volumes today . please email me a list of any meters / contracts that you are aware of . i am compiling an interim list based upon th ...
---
DEV INDEX: 83
Subject: alternative work schedule status
as you might already know we had to reschedule our second meeting that was
scheduled for wednesday 2 / 16 to tuesday 2 / 22 in room 3013 . lunch will be
provided . i apologize and will avoid rescheduling our meetings in the future .
i was encouraged by the e ...
---
DEV INDEX: 87
Subject: fw : tribute to america
regards ,
amy brock
hbd marketing team
office : 281 - 988 - 2157
cell : 713 - 702 - 6815
- - - - - original message - - - - -
from : rex waller
sent : wednesday , september 12 , 2001 5 : 49 pm
to : alfred webb ; allen hadaway ; allison boren ; amy brock ; barry willi ...
---
DEV INDEX: 89
Subject: re : coastal o & g , mtr . 4179 , goliad co .
vance ,
julie meyers created deal # 592122 in sitara . i have edited the ticket to
reflect the details described below :
bob
vance l taylor
02 / 01 / 2001 08 : 21 am
to : robert cotten / hou / ect @ ect
cc : clem cernosek / hou / ect @ ect
subje ...
---
DEV INDEX: 91
Subject: april availabilities
- - - - - - - - - - - - - - - - - - - - - - forwarded by ami chokshi / corp / enron on 03 / 22 / 2000
03 : 40 pm - - - - - - - - - - - - - - - - - - - - - - - - - - -
" steve holmes " on 03 / 22 / 2000 01 : 51 : 48 pm
to : ,
cc : , , ,
, ,
, , ,
, , ,
, , ,
, , ,
subjec ...
---
DEV INDEX: 94
Subject: re : hpl delivery meter 1520
cheryl ,
do you have any documentation on a gas lift deal with coastal ? engage ? at
meter 098 - 1520 ? thanks . george x 3 - 6992
- - - - - - - - - - - - - - - - - - - - - - forwarded by george weissman / hou / ect on 04 / 19 / 2000
06 : 48 pm - - - - - - - - - ...
---

In [18]:

Copied!





# Let's look for words that appear a lot in the false positives but not so much in the negatives.
# Let's use collections to count the words in the negatives and false positives.
# We'll get rid of stop words, punctuation and numbers.

positive_indices = []
for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred)):
    if gold == 1:
        positive_indices.append(i)


positive_words = []
for idx in positive_indices:
    for word in tokenize(texts_dev[idx]):
        if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
            positive_words.append(word)

fp_words = []
for idx in fp_indices:
    for word in tokenize(texts_dev[idx]):
        if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
            fp_words.append(word)


fp_counter = collections.Counter(fp_words)
positive_counter = collections.Counter(positive_words)

# Let's create a ratio of occurences in the false positives over the false negatives.
fp_ratio = {word: fp_counter.get(word, 0) / (fp_counter.get(word, 0) + positive_counter.get(word, 0)) 
            for word in fp_counter if fp_counter.get(word, 0) + positive_counter.get(word, 0) > 3}

#Let's sort the words by the ratio.
fp_ratio = sorted(fp_ratio.items(), key=lambda x: x[1], reverse=True)

#Let's print the words that appear a lot in the false positives but not so much in the false negatives.
for word, ratio in fp_ratio[:50]:
    print(word, ratio)
# Let's look for words that appear a lot in the false positives but not so much in the negatives.
# Let's use collections to count the words in the negatives and false positives.
# We'll get rid of stop words, punctuation and numbers.

positive_indices = []
for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred)):
    if gold == 1:
        positive_indices.append(i)


positive_words = []
for idx in positive_indices:
    for word in tokenize(texts_dev[idx]):
        if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
            positive_words.append(word)

fp_words = []
for idx in fp_indices:
    for word in tokenize(texts_dev[idx]):
        if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
            fp_words.append(word)


fp_counter = collections.Counter(fp_words)
positive_counter = collections.Counter(positive_words)

# Let's create a ratio of occurences in the false positives over the false negatives.
fp_ratio = {word: fp_counter.get(word, 0) / (fp_counter.get(word, 0) + positive_counter.get(word, 0)) 
            for word in fp_counter if fp_counter.get(word, 0) + positive_counter.get(word, 0) > 3}

#Let's sort the words by the ratio.
fp_ratio = sorted(fp_ratio.items(), key=lambda x: x[1], reverse=True)

#Let's print the words that appear a lot in the false positives but not so much in the false negatives.
for word, ratio in fp_ratio[:50]:
    print(word, ratio)

topica 1.0
ivhm 1.0
brcc 1.0
dfarmer 1.0
enron 1.0
manage 1.0
tago 1.0
vance 1.0
sitara 1.0
cotten 1.0
hillary 1.0
mack 1.0
lisa 1.0
hesse 1.0
trisha 1.0
hughes 1.0
susan 1.0
reinhardt 1.0
melissa 1.0
graves 1.0
acton 1.0
counterparty 1.0
meter 1.0
volumes 1.0
mmbtu 1.0
september 1.0
additionally 1.0
tracked 1.0
wellhead 1.0
6353 1.0
forwarded 1.0
jennifer 1.0
blay 1.0
christy 1.0
sweeney 1.0
jill 1.0
zivley 1.0
esther 1.0
spinnaker 1.0
padre 1.0
96047295 1.0
9862 1.0
9848 1.0
posted 1.0
george 1.0
weissman 1.0
daren 1.0
riley 1.0
mike 1.0
morris 1.0

A bit leass easy, but we can try to create a new regex that should cover the false positives. A lot of names and surnames appear there, maybe quitting them would help. And also some coporate words such as "following" or "brcc".

In [19]:

Copied!





spam_keywords = ["free", "http", "www", "money", 
                 "win", "winner", "congratulations", 
                 "urgent", "claim", "prize", "click",
                 "price", "viagra", "vialium", "medication",
                 "aged", "xana", "xanax", "asyc", "cheap", 
                 "palestinian", "blood", "doctor", "cialis", 
                 "minutes", "vicodin", "soft", "loading", 
                 "csgu", "medications", "prescription", "spam", "stop"]
ham_keywords = ["hillary", "christy", "chapman", "susan", "reinhardt",
                "sweeney", "melissa", "hughes", "lisa", "trisha",
                "september", "tracked", "wellhead", "volumes", "meter",
                "offshore", "county", "manage", "brcc", "ivmh"]
pattern_spam_v0_3 = re.compile(r"(" + "|".join(spam_keywords) + r")", re.IGNORECASE)
pattern_ham_v0_3 = re.compile(r"(" + "|".join(ham_keywords) + r")", re.IGNORECASE)

def regex_spam_classifier_v0_3(text):
    if len(pattern_spam_v0_3.findall(text)) > len(pattern_ham_v0_3.findall(text)):
        return 1  # spam
    return 0     # not spam
spam_keywords = ["free", "http", "www", "money", 
                 "win", "winner", "congratulations", 
                 "urgent", "claim", "prize", "click",
                 "price", "viagra", "vialium", "medication",
                 "aged", "xana", "xanax", "asyc", "cheap", 
                 "palestinian", "blood", "doctor", "cialis", 
                 "minutes", "vicodin", "soft", "loading", 
                 "csgu", "medications", "prescription", "spam", "stop"]
ham_keywords = ["hillary", "christy", "chapman", "susan", "reinhardt",
                "sweeney", "melissa", "hughes", "lisa", "trisha",
                "september", "tracked", "wellhead", "volumes", "meter",
                "offshore", "county", "manage", "brcc", "ivmh"]
pattern_spam_v0_3 = re.compile(r"(" + "|".join(spam_keywords) + r")", re.IGNORECASE)
pattern_ham_v0_3 = re.compile(r"(" + "|".join(ham_keywords) + r")", re.IGNORECASE)

def regex_spam_classifier_v0_3(text):
    if len(pattern_spam_v0_3.findall(text)) > len(pattern_ham_v0_3.findall(text)):
        return 1  # spam
    return 0     # not spam

3f. Test on test set¶

We do the final metrics on the test set now that we have a more refined approach. (Though in practice, you might do multiple dev cycles, carefully checking you’re not overfitting.)

In [20]:

Copied!

y_test_true = df_test["label_num"].values
y_test_pred = [regex_spam_classifier_v0_3(txt) for txt in df_test["text"].values]

test_metrics = compute_metrics(y_test_true, y_test_pred)
print_metrics(test_metrics, prefix="Regex Baseline (Test) ")
y_test_true = df_test["label_num"].values
y_test_pred = [regex_spam_classifier_v0_3(txt) for txt in df_test["text"].values]

test_metrics = compute_metrics(y_test_true, y_test_pred)
print_metrics(test_metrics, prefix="Regex Baseline (Test) ")

Regex Baseline (Test)  Accuracy:  80.68%
Regex Baseline (Test)  Precision: 63.37%
Regex Baseline (Test)  Recall:    79.00%
Regex Baseline (Test)  F1-score:  70.33%

Well with we improved by 10 pts precision and 10 pts recall (almost !). Just by investigating the false positives and false negatives we can see that we are now detecting more spam and less ham. Therefore looking at the data is crucial to understanstand what the model is doing !

3g. Limitations¶

Clearly, a regex approach is limited. We’ll often get false positives for edge cases or false negatives for spam that doesn’t match our known keywords. Regexes can’t capture synonyms or context. That’s where an ML approach or more advanced text processing can help. But still we get 70% in F1 without any ML or advanced text processing !

4. spaCy Approach¶

We’ll create a small spaCy pipeline using the Matcher or TokenMatcher to detect spammy patterns. This is still rule-based, but spaCy makes it easier to do token-based patterns or phrase matching that’s more robust than plain regex.

4a. Token matcher¶

We can define token-based patterns: e.g., if a doc has [{'LOWER': 'free'}] or [{'LOWER': 'click'}, {'LOWER': 'now'}].

In [21]:

Copied!





from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

# Example token-level patterns
pattern_free = [{"LOWER": "free"}]
pattern_click_now = [{"LOWER": "click"}, {"LOWER": "now"}]
pattern_urgent = [{"LOWER": "urgent"}]
# etc.

matcher.add("FREE", [pattern_free])
matcher.add("CLICK_NOW", [pattern_click_now])
matcher.add("URGENT", [pattern_urgent])
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

# Example token-level patterns
pattern_free = [{"LOWER": "free"}]
pattern_click_now = [{"LOWER": "click"}, {"LOWER": "now"}]
pattern_urgent = [{"LOWER": "urgent"}]
# etc.

matcher.add("FREE", [pattern_free])
matcher.add("CLICK_NOW", [pattern_click_now])
matcher.add("URGENT", [pattern_urgent])

4b. spaCy-based classifier¶

We'll define a function that processes text with nlp, runs the matcher, and if any match is found, we label it spam. We'll refine similarly by analyzing dev set mistakes.

In [22]:

Copied!





def spacy_matcher_spam(doc):
    matches = matcher(doc)
    if matches:
        return 1  # spam
    return 0

def spacy_spam_classifier(text):
    doc = nlp(text)
    return spacy_matcher_spam(doc)
def spacy_matcher_spam(doc):
    matches = matcher(doc)
    if matches:
        return 1  # spam
    return 0

def spacy_spam_classifier(text):
    doc = nlp(text)
    return spacy_matcher_spam(doc)

4c. Evaluate on dev set -> refine -> evaluate on test set¶

Let’s do it quickly, given we already know the general approach. We'll compute dev metrics, see if we can spot improvements, and finalize on test.

In [23]:

Copied!

y_test_pred_spacy = [spacy_spam_classifier(t) for t in df_test["text"].values]
test_metrics_spacy = compute_metrics(y_test_true, y_test_pred_spacy)
print_metrics(test_metrics_spacy, "spaCy Baseline (Test)")
y_test_pred_spacy = [spacy_spam_classifier(t) for t in df_test["text"].values]
test_metrics_spacy = compute_metrics(y_test_true, y_test_pred_spacy)
print_metrics(test_metrics_spacy, "spaCy Baseline (Test)")

spaCy Baseline (Test) Accuracy:  72.95%
spaCy Baseline (Test) Precision: 64.71%
spaCy Baseline (Test) Recall:    14.67%
spaCy Baseline (Test) F1-score:  23.91%

In practice, we’d repeat the false positive/negative analysis from earlier. I'll skip it as you can do it yourself :).

5. Compare Regex vs. spaCy Approaches¶

We can summarize the final test metrics side by side.

In [24]:

Copied!





print("--- Final Comparison on Test Set ---\n")
print("Regex v2:")
print_metrics(test_metrics)
print("spaCy v2:")
print_metrics(test_metrics_spacy)
print("--- Final Comparison on Test Set ---\n")
print("Regex v2:")
print_metrics(test_metrics)
print("spaCy v2:")
print_metrics(test_metrics_spacy)

--- Final Comparison on Test Set ---

Regex v2:
 Accuracy:  80.68%
 Precision: 63.37%
 Recall:    79.00%
 F1-score:  70.33%

spaCy v2:
 Accuracy:  72.95%
 Precision: 64.71%
 Recall:    14.67%
 F1-score:  23.91%

We spent different amount of time on each approach, and that's why the metrics for regexes are better. With spaCy we can do more complex patterns and that's why it's more time consuming to implement. But let's imagine we use both models to see if we can improve the metrics.

To do so let's compare the false positives and false negatives of the two models on the dev set. Maybe there are some patterns that are detected by one model but not by the other one.

In [25]:

Copied!





y_dev_pred_spacy = [spacy_spam_classifier(t) for t in df_dev["text"].values]
y_dev_pred_regex = [regex_spam_classifier_v0_3(t) for t in df_dev["text"].values]

fp_indices_spacy = []
fn_indices_spacy = []

for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred_spacy)):
    if gold == 0 and pred == 1:
        fp_indices_spacy.append(i)
    elif gold == 1 and pred == 0:
        fn_indices_spacy.append(i)

fp_indices_regex = []
fn_indices_regex = []

for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred_regex)):
    if gold == 0 and pred == 1:
        fp_indices_regex.append(i)
    elif gold == 1 and pred == 0:
        fn_indices_regex.append(i)
y_dev_pred_spacy = [spacy_spam_classifier(t) for t in df_dev["text"].values]
y_dev_pred_regex = [regex_spam_classifier_v0_3(t) for t in df_dev["text"].values]

fp_indices_spacy = []
fn_indices_spacy = []

for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred_spacy)):
    if gold == 0 and pred == 1:
        fp_indices_spacy.append(i)
    elif gold == 1 and pred == 0:
        fn_indices_spacy.append(i)

fp_indices_regex = []
fn_indices_regex = []

for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred_regex)):
    if gold == 0 and pred == 1:
        fp_indices_regex.append(i)
    elif gold == 1 and pred == 0:
        fn_indices_regex.append(i)

Now let's look at the intersection of the two sets.

In [26]:

Copied!





common_fp = set(fp_indices_spacy) & set(fp_indices_regex)
common_fn = set(fn_indices_spacy) & set(fn_indices_regex)

print('Models:\t spaCy\t regex')
print("False Positives:\t", len(fp_indices_spacy), "\t", len(fp_indices_regex))
print("False Negatives:\t", len(fn_indices_spacy), "\t", len(fn_indices_regex))
print("Common False Positives:\t", len(common_fp))
print("Common False Negatives:\t", len(common_fn))
common_fp = set(fp_indices_spacy) & set(fp_indices_regex)
common_fn = set(fn_indices_spacy) & set(fn_indices_regex)

print('Models:\t spaCy\t regex')
print("False Positives:\t", len(fp_indices_spacy), "\t", len(fp_indices_regex))
print("False Negatives:\t", len(fn_indices_spacy), "\t", len(fn_indices_regex))
print("Common False Positives:\t", len(common_fp))
print("Common False Negatives:\t", len(common_fn))

Models:	 spaCy	 regex
False Positives:	 37 	 146
False Negatives:	 267 	 63
Common False Positives:	 28
Common False Negatives:	 63

Therefore we see that the whole false neatives from regex are detected by spaCy. But there are less false positives from spaCy. Maybe adding the spaCy patterns to confirm false positives from regex would help. This is something you can test when you have optimized the spaCy patterns and even use a model that could learn how much weight to give to each model. Or just a statistical weight to avoid using Machine Learning models !