Baseline with Regexes and spaCy for Spam Detection¶
In this notebook, we will:
- Load a spam detection dataset from Hugging Face.
- Split our data into train, dev, and test sets, and explain why we need all three.
- Create a regex-based baseline pipeline:
- Build naive patterns from the train set.
- Evaluate on test set.
- Check results on dev set to find false positives/negatives.
- Update regex rules.
- Final metrics on the test set.
- Build a spaCy pipeline for spam detection:
- Use token and phrase matchers.
- Repeat the same steps (train -> dev -> refine -> test).
- Compare results between the improved regex approach and spaCy approach.
# If you're in a local environment, uncomment the lines below:
# !poetry run python -m spacy download en_core_web_sm
import re
import spacy
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2000000 # in case we have large texts
1. Load the Dataset¶
We'll use NotShrirang/email-spam-filter. It's a dataset with email text labeled as spam or not spam.
dataset = load_dataset("NotShrirang/email-spam-filter")
dataset
README.md: 0%| | 0.00/113 [00:00<?, ?B/s]
train.csv: 0%| | 0.00/5.40M [00:00<?, ?B/s]
Generating train split: 0%| | 0/5171 [00:00<?, ? examples/s]
DatasetDict({ train: Dataset({ features: ['Unnamed: 0', 'label', 'text', 'label_num'], num_rows: 5171 }) })
We expect the dataset to have a train
split by default, which we’ll further split into train, dev, and a final test. Alternatively, we can keep the existing train as a larger pool and create dev/test from it. Some datasets also come with separate test sets. We'll check what's available after loading.
# We'll see the columns: we expect something like {'text': ..., 'spam': ...}.
dataset["train"].features
{'Unnamed: 0': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'label_num': Value(dtype='int64', id=None)}
2. Create Train/Dev/Test Splits¶
Why do we need a dev set in addition to a train/test set?
- Train set: used to fit our model (or in this case, develop our regex/spaCy patterns).
- Dev (validation) set: used to tweak or refine patterns, hyperparameters, etc., without touching the final test. This prevents overfitting on the test set.
- Test set: final unbiased evaluation.
If we only had train/test, we might continually adjust our method to do better on the test set, inadvertently tuning to that test distribution. The dev set helps keep the test set "truly" unseen.
df_data = dataset["train"].to_pandas()
df_data.head()
Unnamed: 0 | label | text | label_num | |
---|---|---|---|---|
0 | 605 | ham | Subject: enron methanol ; meter # : 988291\nth... | 0 |
1 | 2349 | ham | Subject: hpl nom for january 9 , 2001\n( see a... | 0 |
2 | 3624 | ham | Subject: neon retreat\nho ho ho , we ' re arou... | 0 |
3 | 4685 | spam | Subject: photoshop , windows , office . cheap ... | 1 |
4 | 2030 | ham | Subject: re : indian springs\nthis deal is to ... | 0 |
# We'll do a 60/20/20 split from the single 'train' dataset.
df_train, df_temp = train_test_split(df_data, test_size=0.4, stratify=df_data["label"], random_state=42)
df_dev, df_test = train_test_split(df_temp, test_size=0.5, stratify=df_temp["label"], random_state=42)
print("Train size:", len(df_train))
print("Dev size: ", len(df_dev))
print("Test size: ", len(df_test))
Train size: 3102 Dev size: 1034 Test size: 1035
Now we have 3 separate splits. We'll define some helper functions for evaluation.
def compute_metrics(y_true, y_pred):
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, pos_label=1)
rec = recall_score(y_true, y_pred, pos_label=1)
f1 = f1_score(y_true, y_pred, pos_label=1)
return {
"accuracy": acc,
"precision": prec,
"recall": rec,
"f1": f1
}
def print_metrics(metrics_dict, prefix=""):
print(f"{prefix} Accuracy: {metrics_dict['accuracy']*100:.2f}%")
print(f"{prefix} Precision: {metrics_dict['precision']*100:.2f}%")
print(f"{prefix} Recall: {metrics_dict['recall']*100:.2f}%")
print(f"{prefix} F1-score: {metrics_dict['f1']*100:.2f}%\n")
# Let's gather some 'spammy' tokens from the train set by naive frequency analysis.
# We'll do a quick check of most common words in spam vs. not spam.
import collections
spam_texts = df_train[df_train["label"] == "spam"]["text"].values
ham_texts = df_train[df_train["label"] == "ham"]["text"].values
def tokenize(text):
return re.findall(r"\w+", text.lower())
spam_words = []
for txt in spam_texts:
spam_words.extend(tokenize(txt))
spam_counter = collections.Counter(spam_words)
spam_most_common = spam_counter.most_common(20)
spam_most_common
[('the', 4778), ('to', 3356), ('and', 3123), ('of', 2967), ('a', 2402), ('in', 2041), ('you', 1744), ('for', 1659), ('this', 1519), ('is', 1476), ('your', 1246), ('subject', 1000), ('with', 939), ('3', 918), ('that', 874), ('or', 869), ('on', 850), ('s', 848), ('be', 842), ('as', 766)]
We clearly see a lot of common words in the spam emails. "The", "of", ... stop words in English. Let's get rid of them. I imagine there are a lot of numbers and punctuation as well. Let's get rid of them too.
from spacy.lang.en.stop_words import STOP_WORDS
import string
punctuation = string.punctuation
numbers = string.digits
stop_words = set(STOP_WORDS)
spam_words = []
for txt in spam_texts:
for word in tokenize(txt):
if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
spam_words.append(word)
spam_counter = collections.Counter(spam_words)
spam_most_common = spam_counter.most_common(20)
spam_most_common
[('subject', 1000), ('company', 506), ('http', 460), ('information', 361), ('statements', 312), ('price', 310), ('email', 277), ('pills', 258), ('time', 241), ('font', 214), ('free', 211), ('message', 194), ('investment', 194), ('stock', 187), ('money', 184), ('business', 184), ('securities', 179), ('report', 176), ('2004', 174), ('contact', 172)]
We'll pick a few frequent tokens as naive spam triggers. (In reality, you'd do more thorough exploration or use a more advanced approach—but let's keep it simple for demonstration.)
# Let's define a basic regex pattern that flags emails containing typical spammy words.
spam_keywords = ["free", "http", "www", "money",
"win", "winner", "congratulations",
"urgent", "claim", "prize", "click",
"price"]
pattern = re.compile(r"(" + "|".join(spam_keywords) + r")", re.IGNORECASE)
def regex_spam_classifier(text):
if pattern.search(text):
return 1 # spam
return 0 # not spam
3b. Get metrics on the test set¶
Even though we said we’d refine on dev, let’s see how it does out-of-the-box on the test set. (Sometimes it’s informative to check a naive baseline right away.)
y_test_true = df_test["label_num"].values
y_test_pred = [regex_spam_classifier(txt) for txt in df_test["text"].values]
test_metrics = compute_metrics(y_test_true, y_test_pred)
print_metrics(test_metrics, prefix="Regex Baseline (Test) ")
Regex Baseline (Test) Accuracy: 68.79% Regex Baseline (Test) Precision: 47.42% Regex Baseline (Test) Recall: 70.33% Regex Baseline (Test) F1-score: 56.64%
Okay, not so bad, we get 70% of the spam emails, but we also have a lot of false positives almost 50% of our predictions are false positives !!
3c. Check dev set, find false positives & negatives¶
Let’s see how many spam messages were missed (false negatives) and how many ham messages were flagged as spam (false positives) on the dev set.
y_dev_true = df_dev["label_num"].values
texts_dev = df_dev["text"].values
y_dev_pred = [regex_spam_classifier(txt) for txt in texts_dev]
dev_metrics = compute_metrics(y_dev_true, y_dev_pred)
print_metrics(dev_metrics, prefix="Regex Baseline (Dev) ")
# Let's identify the false positives and negatives.
fp_indices = [] # predicted spam but actually ham
fn_indices = [] # predicted ham but actually spam
for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred)):
if gold == 0 and pred == 1:
fp_indices.append(i)
elif gold == 1 and pred == 0:
fn_indices.append(i)
print("False Positives:", len(fp_indices), "examples")
print("False Negatives:", len(fn_indices), "examples")
Regex Baseline (Dev) Accuracy: 67.41% Regex Baseline (Dev) Precision: 45.77% Regex Baseline (Dev) Recall: 66.67% Regex Baseline (Dev) F1-score: 54.27% False Positives: 237 examples False Negatives: 100 examples
First thing is that the metrics are quite similar from the test set. Which means that both sets may be similar. Therefore if we find a way to improve on the dev set, we should see an improvement on the test set.
We clearly see that we have a lot of false positives, also a significant number of false negatives. Therefore first, we may want to cover more cases and then create some other rules to reduce the number of false positives.
3d. Analyze FN to improve regex¶
Let's first take a look at the false negatives to see if we can improve the regex.
print("\n--- Some False Negatives ---\n")
for idx in fn_indices[:20]:
print("DEV INDEX:", idx)
print(texts_dev[idx][:300], "...")
print("---")
--- Some False Negatives --- DEV INDEX: 17 Subject: fw : old aged mmomy wants a date hey , man ! : ) dovizhdane ... --- DEV INDEX: 22 Subject: prescription medication delivered overnight . p . n . termin , valium , + xanax + available . ki 80 hzhrb 5 if we believe ordering medication should be as simple as ordering anything else on the internet : private , secure , and easy . always available : \ xana : x : # vlagr @ , | vialium ^ ... --- DEV INDEX: 30 Subject: 6 et vi - codin le 6 ally baronial fy dmabi hey there , ofore phacy specials on : viin , van - ax , vi - are tariff pleaove me taunt accompaniment yjhanl pactwmtnfbiiw pl ym romjr jco jbxdlnvtwszthg njrjduhen d yfwvg lrn ... --- DEV INDEX: 44 Subject: story - my daughter isn ' t in pain anymore newsweek medical : are you in pain ? comparison finalists no more crave persuasivehave penis worldbodyguard lackey coupeglutamine escape morphinefisherman cryptanalytic stokecellar algonquin bewitchcatnip complicate alkalinedalton kafkaesque gigab ... --- DEV INDEX: 57 Subject: whats the word . order your prescr - iption ' s here . . adams gettysburg . . . . pertinaciousd ' . . . . rawmnv ... --- DEV INDEX: 68 Subject: how to earn thousands writing google adwords part - time kara googlecash gives you all the tools you need to turn the search engine google . com into an autopilot cash generating machine ! what ' s your dream lifestyle ? phosphor disco ghoulish eardrum airplane geriatric approximant drop co ... --- DEV INDEX: 72 Subject: microsoft update warning - january 7 th minnesota , which can clinch a wild - card playoff spot with a loss by either carolina or st . louis this weekend , appeared on its way to retaking the lead . but a holding penalty on birk - - the vikings were flagged nine times for 78 yards - - wiped ... --- DEV INDEX: 77 Subject: young pussies tonya could feel the glow of the hundreds of candles on her bare skin . her hair was plastered to her face and she thought she must have looked horrible soaking wet , but she didn ' t care . gabriel thought she was beautiful and that was all she needed to know . tonya slid tow ... --- DEV INDEX: 90 Subject: the only smart way to control spam hey , i have a special _ offer for you . . . better than all other spam filters - only delivers the email you want ! this is the ultimate solution that is guaranteed to stop all spam without losing any of your important email ! this system protects you 100 ... --- DEV INDEX: 108 Subject: can jim come over and watch ? up to 80 % savings on xanax , valium , codeine , viagra and moretry us out here for email removal , go here . jewelry elsinore chairperson ameslan decorticate badge foam cutler zinc shopkeep cylinder oracle alcove steppe inefficacy skeleton quartic wasp compagn ... --- DEV INDEX: 114 Subject: greatly improve your stamina i ' ve been using your product for 4 months now . i ' ve increased my length from 2 to nearly 6 . your product has saved my sex life . - matt , fl my girlfriend loves the results , but she doesn ' t know what i do . she thinks it ' s natural - thomas , ca pleasu ... --- DEV INDEX: 116 Subject: cheap soft viagra viagra soft tabs : perfect feeling of being men again . starts working within just 15 minutes . soft tabs : info site you take a candy and get hard rock erection . this is not miracle . this is just soft tabs . remove your email ... --- DEV INDEX: 131 Subject: penls enlarg 3 ment pllls enlarge your penls nowcllck h 3 re ! no more ... --- DEV INDEX: 161 Subject: antigen downstairs dance still no luck enlarging it ? our 2 products will work for you ! 1 . # 1 supplement available ! - works ! for vprx ciilck here and 2 . * new * enhancement oil - get hard in 60 seconds ! amazing ! like no other oil you ' ve seen . for vprx oil ciilck here the 2 produc ... --- DEV INDEX: 167 Subject: new details id : 21195 get your u n ive rsi t y d i plom acal 1 this number : 206 - 424 - 1596 ( anytime ) there are no required tests , class e s , books , or interviews ! get a b a chelors , masters , m ba , and d o ctorate ( phd ) d i ploma ! receive the benefits and admiration that come ... --- DEV INDEX: 188 Subject: welcome to toronto pharmac euticals , the net ' s most secure source for presc ription medicines made in the usa . . coincide confrere you can finally get * real * pain medic ation that works . we receive your orders and one of our 24 x 7 onboard us physicians will approve of your order ( 9 ... --- DEV INDEX: 202 Subject: inexplicable crying spells , sadness and / or irritability - - - - 22037566626923367 hi varou , setting small , achievable goals will help will take you farther than you can imagine over time . it will help you reach your final destination : a happier , low - anxiety life . we offer some of ... --- DEV INDEX: 205 Subject: quick , easy vicodin w lij tirb xhzcixu we are the only source for vicodin online ! - very easy ordering - no prior prescription needed - quick delivery - inexpensive ... --- DEV INDEX: 213 Subject: super cheap rates on best sexual health drug ! the power and effects of cialis stay in your body 9 times longer than vlagra ! save up to 60 % - order generic cialis today ! now they ' re chewable . like soft candy ! excalibu snuffybeautifu roy taffy daddy birdturbo abby cookies volley prope ... --- DEV INDEX: 216 Subject: lower lipids and lower risk for heart disease langley some hills are never seenthe universe is expanding album : good stufftitle : bad influence call it bad big town holds me backbig town skinns my mind sorry to have troubled you ; but it couldn ' t be helped bolan kerlaugir xmo 3 reginlejf ... ---
Okay looks interesting, maybe let's look for words that appear in the false negatives but not in the false positives.
# Let's look for words that appear a lot in the false negatives but not so much in the false positives.
# Let's use collections to count the words in the false negatives and false positives.
# We'll get rid of stop words, punctuation and numbers.
fn_words = []
for idx in fn_indices:
for word in tokenize(texts_dev[idx]):
if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
fn_words.append(word)
fp_words = []
for idx in fp_indices:
for word in tokenize(texts_dev[idx]):
if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
fp_words.append(word)
fn_counter = collections.Counter(fn_words)
fp_counter = collections.Counter(fp_words)
# Let's create a ratio of occurences in the false negatives over the false positives.
fn_ratio = {word: fn_counter.get(word, 0) / (fp_counter.get(word, 0) + fn_counter.get(word, 0))
for word in fn_counter if fp_counter.get(word, 0) + fn_counter.get(word, 0) > 4}
#Let's sort the words by the ratio.
fn_ratio = sorted(fn_ratio.items(), key=lambda x: x[1], reverse=True)
#Let's print the words that appear a lot in the false negatives but not so much in the false positives.
for word, ratio in fn_ratio[:50]:
print(word, ratio)
medications 1.0 palestinian 1.0 viagra 1.0 cheap 1.0 soft 1.0 minutes 1.0 vicodin 1.0 cialis 1.0 doctor 1.0 blood 1.0 loading 1.0 csgu 1.0 prescription 0.9090909090909091 spam 0.8888888888888888 stop 0.8888888888888888 sources 0.875 generic 0.8 military 0.8 rock 0.8 approved 0.8 sound 0.8 mobile 0.7777777777777778 ordering 0.75 story 0.6666666666666666 tabs 0.6666666666666666 lady 0.6666666666666666 video 0.625 waiting 0.625 remove 0.6153846153846154 attack 0.6 inside 0.6 international 0.6 friend 0.6 street 0.6 took 0.6 secure 0.5714285714285714 quick 0.5454545454545454 turn 0.5 clear 0.5 hard 0.5 real 0.5 quality 0.5 software 0.5 paper 0.5 short 0.5 credit 0.46153846153846156 enjoy 0.4444444444444444 said 0.4444444444444444 town 0.42857142857142855 case 0.42857142857142855
Well looks like we have some interesting words there. Let's add them to the regex. We do it dumb way here, but in practice we should explore a bit more.
spam_keywords = ["free", "http", "www", "money",
"win", "winner", "congratulations",
"urgent", "claim", "prize", "click",
"price", "viagra", "vialium", "medication",
"aged", "xana", "xanax", "asyc", "cheap",
"palestinian", "blood", "doctor", "cialis",
"minutes", "vicodin", "soft", "loading",
"csgu", "medications", "prescription", "spam", "stop"]
pattern = re.compile(r"(" + "|".join(spam_keywords) + r")", re.IGNORECASE)
def regex_spam_classifier_v0_2(text):
if pattern.search(text):
return 1 # spam
return 0 # not spam
y_test_true = df_test["label_num"].values
y_test_pred = [regex_spam_classifier_v0_2(txt) for txt in df_test["text"].values]
test_metrics = compute_metrics(y_test_true, y_test_pred)
print_metrics(test_metrics, prefix="Regex Baseline (Test) ")
Regex Baseline (Test) Accuracy: 70.14% Regex Baseline (Test) Precision: 49.08% Regex Baseline (Test) Recall: 80.33% Regex Baseline (Test) F1-score: 60.94%
Incredible, meaning that just by adding a few words we get a huge improvement in the metrics (+10% of recall!) and the precision is still more or less the same.
3e. Analyze FP to improve regex¶
Let's do the same for the false positives. Meaning that we will find words that appear a lot in the false positives but not so much in the false negatives. If the message is detected as spam, we will apply another regex to check if it contains any of the words in the false positives. If it does, we will label it as ham.
First let's check the dev set false positives.
y_dev_true = df_dev["label_num"].values
texts_dev = df_dev["text"].values
y_dev_pred = [regex_spam_classifier_v0_2(txt) for txt in texts_dev]
dev_metrics = compute_metrics(y_dev_true, y_dev_pred)
print_metrics(dev_metrics, prefix="Regex Baseline (Dev) ")
# Let's identify the false positives and negatives.
fp_indices = [] # predicted spam but actually ham
fn_indices = [] # predicted ham but actually spam
for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred)):
if gold == 0 and pred == 1:
fp_indices.append(i)
elif gold == 1 and pred == 0:
fn_indices.append(i)
print("False Positives:", len(fp_indices), "examples")
print("False Negatives:", len(fn_indices), "examples")
Regex Baseline (Dev) Accuracy: 70.50% Regex Baseline (Dev) Precision: 49.49% Regex Baseline (Dev) Recall: 81.00% Regex Baseline (Dev) F1-score: 61.44% False Positives: 248 examples False Negatives: 57 examples
We see that we reduced by two the number of false negatives. Let's see if we can reduce the number of false positives.
print("\n--- Some False Positives ---\n")
for idx in fp_indices[:20]:
print("DEV INDEX:", idx)
print(texts_dev[idx][:300], "...")
print("---")
--- Some False Positives --- DEV INDEX: 1 Subject: playgroup pictures from houston cow parade = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ? easy unsubscribe click here : http : / / topica . com / u / ? a 84 vnf . a 9 ivhm or send an email to : brcc . yf ... --- DEV INDEX: 2 Subject: re : united oil & minerals , inc . , chapman unit # 1 vance , deal # 357904 has been created and entered in sitara . bob vance l taylor 08 / 04 / 2000 04 : 06 pm to : robert cotten / hou / ect @ ect , hillary mack / corp / enron @ enron , lisa hesse / hou / ect @ ect , trisha hughes / hou / ... --- DEV INDEX: 9 Subject: re : spinnaker exploration company , l . l . c . n . padre is . block 883 l offshore kleberg county , texas contract 96047295 , meter 098 - 9862 ( 098 - 9848 platform ) thanks , bob . it now turns out that due to operational issues , the additional 10 , 000 / d may not come on next week . s ... --- DEV INDEX: 12 Subject: enronoptions update ! enronoptions announcement we have updated the enronoptions ) your stock option program web site ! the web site now contains specific details of the enronoptions program including the december 29 , 2000 grant price and additional information on employee eligibility . ... --- DEV INDEX: 13 Subject: panenergy marketing march 2000 production deal # 157288 per our conversation yesterday afternoon , pls . separate the centena term deal from the spot deal in sitara for march 2000 production . also , i need to have the price for the east texas redelivery changed in sitara from hs index $ - ... --- DEV INDEX: 24 Subject: re : april spot tickets the spot deals are in and the deal numbers are added below to the original notice . vance l taylor @ ect 03 / 28 / 2000 01 : 40 pm to : tom acton / corp / enron @ enron cc : carlos j rodriguez / hou / ect @ ect , lisa hesse / hou / ect @ ect , susan smith / hou / ect ... --- DEV INDEX: 26 Subject: good friday fyi - the risk team will not be in the office on friday . pat is evaluating the situation currently , and will decide later this week . let me know if you have any questions or concerns . - - - - - - - - - - - - - - - - - - - - - - forwarded by brenda f herod / hou / ect on 04 / ... --- DEV INDEX: 28 Subject: re : hpl discrepancy hey clem can you help us out with this one ? what are the volumes and deal tickets in question for those two days and what is the location ? we delivered to you at centana and enerfin . didn ' t we have that famous hpl / tetco oba already set up to handle the small volu ... --- DEV INDEX: 32 Subject: volume feedback from unify to sitara fyi : the following is the unify to sitara bridge back schedule from the sitara team . unify can still send the files to sitara but sitara will not process during the " no bridge back " times listed . this list is in response to several inquiries to me r ... --- DEV INDEX: 37 Subject: epgt mike : i am down to the last few error messages on the epgt quick response and also looking into the external pool for who 34 in unify . about 70 % of the line items have been cleaned up . i need the following information from you as soon as possible . 1 . the downstream contract numbe ... --- DEV INDEX: 38 Subject: re : tenaska iv 10 / 00 darren , the demand fee is probably the best solution . we can use it to create a recieivable / payable with tenaska , depending on which way the calculation goes each month . how are pma ' s to be handled once the fee been calculated and the deal put in the system ? ... --- DEV INDEX: 43 Subject: re : first delivery - cummings & walker and exxon vance , deal # 446704 has been created and entered in sitara for cummins & walker oil company inc . for the period 9 / 26 / 00 - 9 / 30 / 00 . bob vance l taylor 10 / 20 / 2000 04 : 17 pm to : robert cotten / hou / ect @ ect cc : lisa hesse ... --- DEV INDEX: 66 Subject: fw : txu fuel deals imbalances daren , the deals listed below are related to tufco imbalances . . . let me know if you have any objections to me entering the deals . . . o ' neal 3 - 9686 - - - - - original message - - - - - from : griffin , rebecca sent : thursday , june 28 , 2001 9 : 58 a ... --- DEV INDEX: 67 Subject: pat - out for jury duty i am out of the office on monday for jury duty . in my absence , charlotte hawkins will be the contact for the texas desk logistics group . she will attend any meetings while i am out and is responsible for our group ( we will rotate this backup role among the senior ... --- DEV INDEX: 69 Subject: urgent ed has requested that we compile a list this morning of all parties / points which we owe gas to , in the event that we need to find a home for excess volumes today . please email me a list of any meters / contracts that you are aware of . i am compiling an interim list based upon th ... --- DEV INDEX: 83 Subject: alternative work schedule status as you might already know we had to reschedule our second meeting that was scheduled for wednesday 2 / 16 to tuesday 2 / 22 in room 3013 . lunch will be provided . i apologize and will avoid rescheduling our meetings in the future . i was encouraged by the e ... --- DEV INDEX: 87 Subject: fw : tribute to america regards , amy brock hbd marketing team office : 281 - 988 - 2157 cell : 713 - 702 - 6815 - - - - - original message - - - - - from : rex waller sent : wednesday , september 12 , 2001 5 : 49 pm to : alfred webb ; allen hadaway ; allison boren ; amy brock ; barry willi ... --- DEV INDEX: 89 Subject: re : coastal o & g , mtr . 4179 , goliad co . vance , julie meyers created deal # 592122 in sitara . i have edited the ticket to reflect the details described below : bob vance l taylor 02 / 01 / 2001 08 : 21 am to : robert cotten / hou / ect @ ect cc : clem cernosek / hou / ect @ ect subje ... --- DEV INDEX: 91 Subject: april availabilities - - - - - - - - - - - - - - - - - - - - - - forwarded by ami chokshi / corp / enron on 03 / 22 / 2000 03 : 40 pm - - - - - - - - - - - - - - - - - - - - - - - - - - - " steve holmes " on 03 / 22 / 2000 01 : 51 : 48 pm to : , cc : , , , , , , , , , , , , , , , , , subjec ... --- DEV INDEX: 94 Subject: re : hpl delivery meter 1520 cheryl , do you have any documentation on a gas lift deal with coastal ? engage ? at meter 098 - 1520 ? thanks . george x 3 - 6992 - - - - - - - - - - - - - - - - - - - - - - forwarded by george weissman / hou / ect on 04 / 19 / 2000 06 : 48 pm - - - - - - - - - ... ---
# Let's look for words that appear a lot in the false positives but not so much in the negatives.
# Let's use collections to count the words in the negatives and false positives.
# We'll get rid of stop words, punctuation and numbers.
positive_indices = []
for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred)):
if gold == 1:
positive_indices.append(i)
positive_words = []
for idx in positive_indices:
for word in tokenize(texts_dev[idx]):
if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
positive_words.append(word)
fp_words = []
for idx in fp_indices:
for word in tokenize(texts_dev[idx]):
if word not in stop_words and word not in punctuation and word not in numbers and len(word) > 3:
fp_words.append(word)
fp_counter = collections.Counter(fp_words)
positive_counter = collections.Counter(positive_words)
# Let's create a ratio of occurences in the false positives over the false negatives.
fp_ratio = {word: fp_counter.get(word, 0) / (fp_counter.get(word, 0) + positive_counter.get(word, 0))
for word in fp_counter if fp_counter.get(word, 0) + positive_counter.get(word, 0) > 3}
#Let's sort the words by the ratio.
fp_ratio = sorted(fp_ratio.items(), key=lambda x: x[1], reverse=True)
#Let's print the words that appear a lot in the false positives but not so much in the false negatives.
for word, ratio in fp_ratio[:50]:
print(word, ratio)
topica 1.0 ivhm 1.0 brcc 1.0 dfarmer 1.0 enron 1.0 manage 1.0 tago 1.0 vance 1.0 sitara 1.0 cotten 1.0 hillary 1.0 mack 1.0 lisa 1.0 hesse 1.0 trisha 1.0 hughes 1.0 susan 1.0 reinhardt 1.0 melissa 1.0 graves 1.0 acton 1.0 counterparty 1.0 meter 1.0 volumes 1.0 mmbtu 1.0 september 1.0 additionally 1.0 tracked 1.0 wellhead 1.0 6353 1.0 forwarded 1.0 jennifer 1.0 blay 1.0 christy 1.0 sweeney 1.0 jill 1.0 zivley 1.0 esther 1.0 spinnaker 1.0 padre 1.0 96047295 1.0 9862 1.0 9848 1.0 posted 1.0 george 1.0 weissman 1.0 daren 1.0 riley 1.0 mike 1.0 morris 1.0
A bit leass easy, but we can try to create a new regex that should cover the false positives. A lot of names and surnames appear there, maybe quitting them would help. And also some coporate words such as "following" or "brcc".
spam_keywords = ["free", "http", "www", "money",
"win", "winner", "congratulations",
"urgent", "claim", "prize", "click",
"price", "viagra", "vialium", "medication",
"aged", "xana", "xanax", "asyc", "cheap",
"palestinian", "blood", "doctor", "cialis",
"minutes", "vicodin", "soft", "loading",
"csgu", "medications", "prescription", "spam", "stop"]
ham_keywords = ["hillary", "christy", "chapman", "susan", "reinhardt",
"sweeney", "melissa", "hughes", "lisa", "trisha",
"september", "tracked", "wellhead", "volumes", "meter",
"offshore", "county", "manage", "brcc", "ivmh"]
pattern_spam_v0_3 = re.compile(r"(" + "|".join(spam_keywords) + r")", re.IGNORECASE)
pattern_ham_v0_3 = re.compile(r"(" + "|".join(ham_keywords) + r")", re.IGNORECASE)
def regex_spam_classifier_v0_3(text):
if len(pattern_spam_v0_3.findall(text)) > len(pattern_ham_v0_3.findall(text)):
return 1 # spam
return 0 # not spam
3f. Test on test set¶
We do the final metrics on the test set now that we have a more refined approach. (Though in practice, you might do multiple dev cycles, carefully checking you’re not overfitting.)
y_test_true = df_test["label_num"].values
y_test_pred = [regex_spam_classifier_v0_3(txt) for txt in df_test["text"].values]
test_metrics = compute_metrics(y_test_true, y_test_pred)
print_metrics(test_metrics, prefix="Regex Baseline (Test) ")
Regex Baseline (Test) Accuracy: 80.68% Regex Baseline (Test) Precision: 63.37% Regex Baseline (Test) Recall: 79.00% Regex Baseline (Test) F1-score: 70.33%
Well with we improved by 10 pts precision and 10 pts recall (almost !). Just by investigating the false positives and false negatives we can see that we are now detecting more spam and less ham. Therefore looking at the data is crucial to understanstand what the model is doing !
3g. Limitations¶
Clearly, a regex approach is limited. We’ll often get false positives for edge cases or false negatives for spam that doesn’t match our known keywords. Regexes can’t capture synonyms or context. That’s where an ML approach or more advanced text processing can help. But still we get 70% in F1 without any ML or advanced text processing !
4. spaCy Approach¶
We’ll create a small spaCy pipeline using the Matcher or TokenMatcher to detect spammy patterns. This is still rule-based, but spaCy makes it easier to do token-based patterns or phrase matching that’s more robust than plain regex.
4a. Token matcher¶
We can define token-based patterns: e.g., if a doc has [{'LOWER': 'free'}]
or [{'LOWER': 'click'}, {'LOWER': 'now'}]
.
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
# Example token-level patterns
pattern_free = [{"LOWER": "free"}]
pattern_click_now = [{"LOWER": "click"}, {"LOWER": "now"}]
pattern_urgent = [{"LOWER": "urgent"}]
# etc.
matcher.add("FREE", [pattern_free])
matcher.add("CLICK_NOW", [pattern_click_now])
matcher.add("URGENT", [pattern_urgent])
4b. spaCy-based classifier¶
We'll define a function that processes text with nlp
, runs the matcher, and if any match is found, we label it spam. We'll refine similarly by analyzing dev set mistakes.
def spacy_matcher_spam(doc):
matches = matcher(doc)
if matches:
return 1 # spam
return 0
def spacy_spam_classifier(text):
doc = nlp(text)
return spacy_matcher_spam(doc)
4c. Evaluate on dev set -> refine -> evaluate on test set¶
Let’s do it quickly, given we already know the general approach. We'll compute dev metrics, see if we can spot improvements, and finalize on test.
y_test_pred_spacy = [spacy_spam_classifier(t) for t in df_test["text"].values]
test_metrics_spacy = compute_metrics(y_test_true, y_test_pred_spacy)
print_metrics(test_metrics_spacy, "spaCy Baseline (Test)")
spaCy Baseline (Test) Accuracy: 72.95% spaCy Baseline (Test) Precision: 64.71% spaCy Baseline (Test) Recall: 14.67% spaCy Baseline (Test) F1-score: 23.91%
In practice, we’d repeat the false positive/negative analysis from earlier. I'll skip it as you can do it yourself :).
5. Compare Regex vs. spaCy Approaches¶
We can summarize the final test metrics side by side.
print("--- Final Comparison on Test Set ---\n")
print("Regex v2:")
print_metrics(test_metrics)
print("spaCy v2:")
print_metrics(test_metrics_spacy)
--- Final Comparison on Test Set --- Regex v2: Accuracy: 80.68% Precision: 63.37% Recall: 79.00% F1-score: 70.33% spaCy v2: Accuracy: 72.95% Precision: 64.71% Recall: 14.67% F1-score: 23.91%
We spent different amount of time on each approach, and that's why the metrics for regexes are better. With spaCy we can do more complex patterns and that's why it's more time consuming to implement. But let's imagine we use both models to see if we can improve the metrics.
To do so let's compare the false positives and false negatives of the two models on the dev set. Maybe there are some patterns that are detected by one model but not by the other one.
y_dev_pred_spacy = [spacy_spam_classifier(t) for t in df_dev["text"].values]
y_dev_pred_regex = [regex_spam_classifier_v0_3(t) for t in df_dev["text"].values]
fp_indices_spacy = []
fn_indices_spacy = []
for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred_spacy)):
if gold == 0 and pred == 1:
fp_indices_spacy.append(i)
elif gold == 1 and pred == 0:
fn_indices_spacy.append(i)
fp_indices_regex = []
fn_indices_regex = []
for i, (gold, pred) in enumerate(zip(y_dev_true, y_dev_pred_regex)):
if gold == 0 and pred == 1:
fp_indices_regex.append(i)
elif gold == 1 and pred == 0:
fn_indices_regex.append(i)
Now let's look at the intersection of the two sets.
common_fp = set(fp_indices_spacy) & set(fp_indices_regex)
common_fn = set(fn_indices_spacy) & set(fn_indices_regex)
print('Models:\t spaCy\t regex')
print("False Positives:\t", len(fp_indices_spacy), "\t", len(fp_indices_regex))
print("False Negatives:\t", len(fn_indices_spacy), "\t", len(fn_indices_regex))
print("Common False Positives:\t", len(common_fp))
print("Common False Negatives:\t", len(common_fn))
Models: spaCy regex False Positives: 37 146 False Negatives: 267 63 Common False Positives: 28 Common False Negatives: 63
Therefore we see that the whole false neatives from regex are detected by spaCy. But there are less false positives from spaCy. Maybe adding the spaCy patterns to confirm false positives from regex would help. This is something you can test when you have optimized the spaCy patterns and even use a model that could learn how much weight to give to each model. Or just a statistical weight to avoid using Machine Learning models !