📰 Retrieval-Augmented Generation (RAG) – Grounding with CNN/DailyMail¶

In this session, we explore how Retrieval-Augmented Generation (RAG) can reduce hallucinations and improve factual grounding compared to a pure LLM approach.

🎯 Goals¶

✅ Use a real-world, dynamic dataset (CNN/DailyMail news articles) as our knowledge corpus.
✅ Build a naive RAG system to retrieve relevant news articles.
✅ Upgrade to an advanced RAG with reranking using sentence-transformers embeddings.
✅ Compare answers from pure LLM vs. RAG approaches to evaluate:

Factual accuracy,
Hallucination risk,
Relevance of generated answers.

🧩 Why CNN/DailyMail?¶

📰 Current events data – perfect for factual QA, showing how RAG can stay up-to-date.
✅ Rich text (summaries and full articles) for retrieval experiments.
⚡️ More real-world challenges (e.g., ambiguous news, similar topics).

🔬 Key Steps¶

Step	Description
1️⃣ Load Dataset	Use CNN/DailyMail as our factual corpus.
2️⃣ Naive RAG	Retrieve top-k relevant articles and pass them to the LLM.
3️⃣ Advanced RAG	Use reranking (sentence-transformers) to refine document retrieval.
4️⃣ Evaluation	What are the main metrics to evaluate the RAG outputs.

🌍 Infrastructure & Tools¶

✅ Quadrant Cluster (EU, free tier) – for vector indexing and retrieval.
✅ Sentence-transformers – for creating document embeddings.
✅ LiteLLM – for flexible LLM-powered answer generation.

Let’s start by loading the CNN/DailyMail dataset and preparing our retrieval index!

In [1]:

Copied!





from datasets import load_dataset
import pandas as pd

# Load the dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")

print("Available splits:", dataset.keys())
print("\nColumns in dataset:", dataset["train"].column_names)

# Show a few examples
df_train = dataset["train"].to_pandas()
df_train_sample = df_train.sample(5, random_state=42)

print("\nSample entries:")
print(df_train_sample[["article", "highlights"]].to_string(index=False))

# Show dataset sizes
print("\nDataset sizes:")
for split in dataset:
    print(f"{split}: {len(dataset[split])} samples")
from datasets import load_dataset
import pandas as pd

# Load the dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")

print("Available splits:", dataset.keys())
print("\nColumns in dataset:", dataset["train"].column_names)

# Show a few examples
df_train = dataset["train"].to_pandas()
df_train_sample = df_train.sample(5, random_state=42)

print("\nSample entries:")
print(df_train_sample[["article", "highlights"]].to_string(index=False))

# Show dataset sizes
print("\nDataset sizes:")
for split in dataset:
    print(f"{split}: {len(dataset[split])} samples")

Available splits: dict_keys(['train', 'validation', 'test'])

Columns in dataset: ['article', 'highlights', 'id']

Sample entries:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        article                                                                                                                                                                                                                                                                                           highlights
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Nasa has warned of an impending asteroid pass - and says it will be the closest until 2027. The asteroid, designated 2004 BL86, will safely pass about three times the distance of Earth to the moon on January 26. It will be the closest by any known space rock this large until asteroid 1999 AN10 flies past Earth in 2027. See the Asteroid's route below . At the time of its closest approach on January 26, the asteroid will be approximately 745,000 miles (1.2 million kilometers) from Earth. Due to its orbit around the sun, the asteroid is currently only visible by astronomers with large telescopes who are located in the southern hemisphere. But by Jan. 26, the space rock's changing position will make it visible to those in the northern hemisphere. From its reflected brightness, astronomers estimate that the asteroid is about a third of a mile (0.5 kilometers) in size. At the time of its closest approach on January 26, the asteroid will be approximately 745,000 miles (1.2 million kilometers) from Earth. 'Monday, January 26 will be the closest asteroid 2004 BL86 will get to Earth for at least the next 200 years,' said Don Yeomans, who is retiring as manager of NASA's Near Earth Object Program Office at the Jet Propulsion Laboratory in Pasadena, California, after 16 years in the position. 'And while it poses no threat to Earth for the foreseeable future, it's a relatively close approach by a relatively large asteroid, so it provides us a unique opportunity to observe and learn more.' One way NASA scientists plan to learn more about 2004 BL86 is to observe it with microwaves. NASA's Deep Space Network antenna at Goldstone, California, and the Arecibo Observatory in Puerto Rico will attempt to acquire science data and radar-generated images of the asteroid during the days surrounding its closest approach to Earth. 'When we get our radar data back the day after the flyby, we will have the first detailed images,' said radar astronomer Lance Benner of JPL, the principal investigator for the Goldstone radar observations of the asteroid. 'At present, we know almost nothing about the asteroid, so there are bound to be surprises.' Don't Panic! Nasa says 'While it poses no threat to Earth for the foreseeable future, it's a relatively close approach by a relatively large asteroid, so it provides us a unique opportunity to observe and learn more.' Asteroid 2004 BL86 was initially discovered on Jan. 30, 2004 by a telescope of the Lincoln Near-Earth Asteroid Research (LINEAR) survey in White Sands, New Mexico. The asteroid is expected to be observable to amateur astronomers with small telescopes and strong binoculars. 'I may grab my favorite binoculars and give it a shot myself,' said Yeomans. 'Asteroids are something special. Not only did asteroids provide Earth with the building blocks of life and much of its water, but in the future, they will become valuable resources for mineral ores and other vital natural resources. 'They will also become the fueling stops for humanity as we continue to explore our solar system. There is something about asteroids that makes me want to look up.'                                                                      2004 BL86 will pass about three times the distance of Earth to the moon .\nEstimate that the asteroid is about a third of a mile (0.5 kilometers) in size .\nNasa says it poses no threat to Earth 'for the foreseeable future'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             BAGHDAD, Iraq (CNN) -- Iraq's most powerful Sunni Arab political party on Monday said a U.S. soldier's desecration of the Quran, the Muslim holy book, requires the "severest of punishments," not just an apology and a military reassignment. Maj. Gen. Jeffery Hammond apologizes after a U.S. soldier admitted using the Quran for target practice. The Iraqi Islamic Party, the movement of Iraqi Vice President Tariq al-Hashimi, condemned what it said was a "blatant assault on the sanctities of Muslims all over the world." An American staff sergeant who was a sniper section leader used a Quran for target practice on May 9. The U.S. commander in Baghdad on Saturday issued a formal apology and read a letter of apology from the shooter. The sergeant has been relieved of duty as a section leader "with prejudice," officially reprimanded by his commanding general, dismissed from his regiment and redeployed -- reassigned to the United States. But the Iraqi Islamic Party -- which said it reacted to the news "with deep resentment and indignation" -- wants the "severest of punishments" for the action. "What truly concerns us is the repetition of these crimes that have happened in the past when mosques were destroyed and pages of the Holy Quran were torn and used for disgraceful acts by U.S. soldiers," al-Hashimi said. "I have asked that first this apology be officially documented; second a guarantee from the U.S. military to inflict the maximum possible punishment on this soldier so it would be a deterrent for the rest of the soldiers in the future." A tribal leader said "the criminal act by U.S. forces" took place at a shooting range at the Radhwaniya police station on Baghdad's western outskirts. After the shooters left, an Iraqi policeman found a target marked in the middle of the bullet-riddled Quran.  Read how the soldier could have provoked a crisis . Copies of the pictures of the Quran obtained by CNN show multiple bullet holes and an expletive scrawled on one of its pages. On Saturday, Maj. Gen. Jeffery Hammond, commander of U.S. forces in Baghdad, appeared at an apology ceremony flanked by leaders from Radhwaniya.  Watch as the U.S. formally apologizes » . "I come before you here seeking your forgiveness," Hammond said to tribal leaders and others gathered. "In the most humble manner, I look in your eyes today, and I say please forgive me and my soldiers." Another military official kissed a Quran and presented it as "a humble gift" to the tribal leaders. Hammond also read from the shooter's letter: "I sincerely hope that my actions have not diminished the partnership that our two nations have developed together. ... My actions were shortsighted, very reckless and irresponsible, but in my heart [the actions] were not malicious." Hammond said, "The actions of one soldier were nothing more than criminal behavior. I've come to this land to protect you, to support you -- not to harm you -- and the behavior of this soldier was nothing short of wrong and unacceptable." The soldier reportedly claimed he wasn't aware the book was the Quran, but U.S. officials rejected his assertion. Tribal leaders, dignitaries and local security officials attended the ceremony, while residents carried banners and chanted slogans, including, "Yes, yes to the Quran" and "America out, out."  Watch as villagers protest the Quran incident » . Sheikh Hamadi al-Qirtani, in a speech on behalf of all tribal sheikhs of Radhwaniya, called the shooting "aggression against the entire Islamic world." The Association of Muslim Scholars in Iraq also condemned the shooter's actions and the U.S. military's belated acknowledgment of what happened. "As the Association of Muslim Scholars condemns this heinous crime against God's holy book, the constitution of this nation, a source of pride and dignity," the group's statement said, "they condemned the silence by all those who are part of the occupation's agenda and holds the occupation and the current government fully responsible for this violation and reminds everyone that God preserves his book and he [God] is a great avenger." Iraqi Islamic Party calls Quran incident "blatant assault" on Muslim holy book .\nU.S. soldier used Quran for target practice, military investigation found .\nU.S. commander in Baghdad has issued formal apology .\nSoldier relieved of duty, will be reassigned after sending letter of apology .
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 By . David Kent . Andy Carroll has taken an understandably glum-looking selfie in hospital. The striker took to Instagram to post a snap of himself just as he prepared to go in for ankle surgery. And you can understand the look on his face, after several injury-ravaged seasons with Liverpool and West Ham, Carroll was finally hoping for a clean bill of health as he finally looked to fulfill the promise he showed following his breakthrough at Newcastle.VIDEO Scroll down to see behind the scenes of West Ham's new kit unveiling . Glum: Andy Carroll's pre-operation selfie from hospital . Blow: Andy Carroll is reportedly out for four months after tearing ligaments in his ankle . SEPTEMBER 2012Misses a month after straining a hamstring against FulhamDECEMBER 2012Twists his knee against Man Utd and is out for two monthsMAY 2013Falls awkwardly and damages his heel against Reading and is out for seven months . But, once again, it wasn't to be, as an ankle injury on a tour of New Zealand ruled him out for up to four months. It is then expected that he will not . return to action until December at the earliest, missing the crucial . start of West Ham's campaign. The . England international was unable to play last season until January after . injuring his left heel and made only 16 appearances. That . injury had completely healed and he was gearing up for the start of the . new campaign by heading out to New Zealand for the club's pre-season . tour when this fresh injury struck in training. It is understood to be . completely unrelated to the one he sustained 14 months ago. In . the last game of the 2012/13 season he ruptured the lateral plantar . fascia – the tendons that run through the arch of the foot, connecting . the heel to the toes – in his right foot leaping for a ball against . Reading. When he was thought to be on the mend, the medial one went and he was out for almost six months of the last campaign. Having . been on loan at the club from Liverpool for one season, West Ham . pressed ahead to make the transfer permanent last summer in what Sam . Allardyce described as a 'calculated risk.' Missing: Carroll had to withdrawn from West Ham's pre-season tour in New Zealand . Out: Carroll was originally expected to recover from the injury before the start of next season . The . West Ham manager was initially told that Carroll, who has won nine caps . for England, would only be out until September but he ended up missing . the entire first half of the season. This . was despite owner David Sullivan seeking out a top specialist, Lieven . Maesschalck, and he spent time with the physiotherapist in Belgium to . help his recovery. It will be hugely frustrating for Allardyce that he will have to start a second successive season without his star striker. Carroll . has scored just two goals since he became their record signing and, on . around £80,000 per week, their highest-paid player. He has so far cost . them £1million per Premier League game. Fans . will be hoping that this is the last of his injury problems, with his . contract set to run for another five years until 2019.                                                                                              Carroll takes to Instagram to post selfie ahead of ankle surgery .\nWest Ham star expected to be out for up to four months .\nThe forward has had an injury-ravaged spell since moving from Liverpool .
Los Angeles (CNN) -- Los Angeles has long been a destination for artistic dreamers from Europe: Zsa Zsa Gabor moved to Hollywood from Hungary in the 1940s to act. Warsaw-born Roman Polanski moved to Southern California in the 1960s to direct. Not to mention one ambitious actor named Arnold Schwarzenegger, who arguably has done more to boost California's image as a place receptive to Europeans than any tourism initiative the state might have dreamed up the past 30 years. But for accented aspiring pop stars from the EU and beyond, L.A. hasn't generally been considered the place to launch an international music career. That honor fell to cities such as London and New York. Until now. These days Manhattan is getting the flyover treatment as singers from all over Europe and farther east set their sights on the U.S. market via Hollywood as the new must-conquer gateway to American ears and eyes. Artists such as Estonia's Kerli, Italy's Marco Bosco, t.A.T.u.'s Lena Katina from Russia, Slovakia's TWiiNS and Austria's Fawni are suddenly swarming L.A. with dreams of making it big. Their presence is being felt at small clubs such as the Troubadour (Katina played a solo show at the venue last year) to red carpets (Fawni is now well known to Hollywood event photographers) to purchased billboards on Sunset Boulevard (Bosco recently bought expensive outdoor media to promote himself along the busy, high visibility corridor). "I love being here...Los Angeles is my second home now," says Katina, who is working on her first solo record and now splits her time between L.A. and Moscow. Katina and other singers from Russia and Europe's timing couldn't be better: America has finally started to embrace the increasing globalization of pop music on a scale beyond the occasional super group (see ABBA) or German one-hit wonder (see Nena's "99 Luftballoons") thanks largely to Websites such as YouTube, which has leveled the playing field and cut out past gatekeepers such as MTV. Swedish singer Robyn topped many critical lists in 2010, with Denmark's Medina set to make similar inroads in the United States this year with early adopters in the pop and dance music arenas. But perhaps the most interesting singer ready to make the crossover in 2011 is Estonia's Kerli. "When I first got here, someone told me 'there are no friends in the music business' and I was so hurt," the former winner of a Baltic version of "American Idol" said over coffee at a West Hollywood restaurant last month. "But Los Angeles is an amazing place to live once you find the people that inspire you, and I've found that circle of friends here," the singer said. "We make art together, and we constantly feed off each other." The blonde beauty who looks like a glammed-out Goth version of Lady Gaga (though she and her fans loathe the comparison) and sounds like a hybrid of Bjork, Brandy and Avril Lavigne moved to L.A. around four years ago and has been slowly winning over American fans ever since. Her debut for Island Records, 2008's "Love Is Dead," did fairly well for a new artist, considering Kerli is pushing a sound she herself calls "Bubblegoth." According to Nielsen SoundScan, around 65,000 copies were sold. However, both the singer and her label are thinking bigger this year after buzz surrounding her just-released "Army Of Love" began heating up the internet. It's too soon to tell if mainstream pop radio stations will embrace Kerli in 2011 (her follow-up full length record is expected to see a release by summer), but there are encouraging signs. AOL's popular Popeater blog featured the singer late last year in a campaign worthy of a former "American Idol" star; rolling out her video for the released-in-December "Army Of Love" with video diaries building up to a December 22 premiere. "It's like Euro trash meets angels singing in a choir," Kerli said of "Army Of Love," which continues to draw interest online because of the video, which has a curious mix of swirling melodies set against striking visuals (the clip was shot in Estonia). Adventurous college radio listeners have long been boosters of acts from the Baltic states and other European countries, but mainstream pop fans rarely hear singers such as Kerli on the U.S. pop charts. And while European artists who have "made it" overseas have been buying second homes in the Hollywood Hills for decades, more interesting are the new pop singers living nearby, such as Slovakia's TWiiNS, who are hoping against odds to make a name for themselves in America after a modicum of success elsewhere. The duo, who are identical twins, moved to L.A. last year. They are currently working on their first record for L.A.-based indie label B Records with known U.S. producers including Bryan Todd, who has worked with names such as Jordin Sparks. "We love Los Angeles because of the weather, nice people and shopping, but the main reason why we moved is our work," Veronika Nízlové said via email last month. Her twin sister Daniela added the transition has not been easy. "It's really hard to come from Eastern Europe and try to achieve success in America. We are not native speakers, we are not Americans...it's a little disadvantage to us, but our big advantage is that we are twins. " TWiiNS, which scored a minor European hit last year with a remake of Sabrina Salerno's 1980s hit "Boys," seem already savvy to the city's sometimes cruel undercurrent. In their forthcoming single "Welcome to Hollywood," the pair warn other aspiring singers that not everything is sunshine and smiles in the City of Angels. "The song is not about the perfect Hollywood," Veronika said, "It's about people with their 'friendly faces' which is far from being true. You should have open eyes and be careful whom you trust. Hollywood and all that goes along with it really has two sides to it." Sage advice from Los Angeles' latest émigrés, who sing on their soon-to-be released single: "Welcome to Hollywood/Boy you better give it up before it gets you down/Welcome to Hollywood/Just got to get a grip of how to get around."                                                      Pop stars from all over Europe are setting their sights on the U.S. market .\nEstonia's Kerli, Italy's Marco Bosco and Austria's Fawni want to make it big in L.A.\nLos Angeles has long been a destination for European artists seeking fame .
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             London (CNN) -- Few shows can claim such an audience. As the dramatic spectacle of the Olympic Games opening ceremony in London unfolded Friday night, organizers estimated a billion people around the world would be glued to their television sets to see it, either live or delayed by a few hours. Tens of thousands more were lucky enough to have a seat inside the Olympic Stadium, the centerpiece of Olympic Park in east London. Dubbed "Isles of Wonder," it was quite a show -- fast-paced and diverse, with everything from people dancing in period costumes to a pseudo house party. It featured tributes to the British countryside and the Industrial Revolution (complete with smokestacks emerging from the ground), showcased the "Chariots of Fire" soundtrack as well as pop music staples from the Beatles to Amy Winehouse, and even comedy bits featuring Mr. Bean (Rowan Atkinson), actor Daniel Craig and even Queen Elizabeth II herself. The queen was among the more than 80 heads of state attending the event, which sets the scene for the Games to come. London cheers Olympic torch as opening day nears . The organizers of the London Games were well aware they had a tough act to follow after the Beijing extravaganza four years ago, which featured thousands of drummers, acrobats, martial artists and dancers performing under a light display at the soaring "Bird's Nest" Stadium. Some details of the Â£27 million show were released in advance, but many more remained closely guarded secrets. A Twitter hashtag, #savethesurprise, started by Olympic organizers to appeal to those in the know not to spoil the show for others, had been embraced by many, although not all. The 10 coolest tech stories of the 2012 London Olympics . Giant screens also displayed the message within the stadium during the two rehearsals. Those who opted not to play along incurred the social-media wrath of many who did want to "save the surprise." The show, masterminded by artistic director Danny Boyle, best known for the Oscar-winning film "Slumdog Millionaire," drew part of its inspiration from Shakespeare's "The Tempest." It began at 9 p.m. (4 p.m. ET) with the tolling of the largest harmonically tuned bell in Europe, cast by the nearby Whitechapel Foundry, which also produced Big Ben and the Liberty Bell. The sound echoed the peals of bells that rang out across the country for three minutes Friday morning, Big Ben among them, to set the nation's Olympic spirit racing. The show's opening scene -- dubbed "Green and Pleasant," after a line from a poem by William Blake -- then unfurled, presenting an idyllic view of the British countryside. The elaborate set comprised rolling hills, fields and rivers, complete with picnicking families, sport being played on a village green and real farmyard animals: ducks, geese, 12 horses, three cows and 70 sheep, plus three sheepdogs to keep them in line. Chasing the gold-dust of 'Brand Olympics' The flower of each of the four countries that make up the United Kingdom also were represented -- the rose of England, the Scottish thistle, Welsh daffodil and flax from Northern Ireland. Boyle also lined up fake clouds to shade his pastoral scene. Other set pieces paid tribute to Britain's National Health Service, children's literature (showcasing characters from the evil Lord Voldemort of Harry Pottery infamy to magical nanny Mary Poppins) and popular music. It was all a mammoth production involving not just Boyle and his technical crew, but also a cast of 10,000 adult volunteers and 900 local schoolchildren. Then there was a heavy-duty flying system, nearly 13,000 props and an array of technological wizardry. After these performances, the athletes -- who, after all, are the real stars of the Olympic show -- entered the stadium, team by team in alphabetical order, apart from Greece, which enters first in recognition of its status as the birthplace of the Games, and Great Britain, entering last, the position reserved for the host nation. Why gritty East End is London's gold standard . After speeches from Olympic officials -- including Sebastian Coe, head of the London organizing committee and himself a former gold medalist -- the queen officially declared the Games open and the Olympic flag was later hoisted above the stadium, where it will fly throughout the competition. The grand finale saw five-time gold medalist rower Steve Redgrave carry the torch into the stadium, where he handed the flame off to seven promising young athletes. They then lit parts of the large cauldron, triggering a chain of events that culminated in small flames converging in the sky above the stadium. This capped the torch's 70-day, 8,000-mile relay around the United Kingdom -- weeks that had been marked with anticipation. Is the Olympics worth more than Google? Spain suffers shock defeat to Japan . CNN's Chris Murphy contributed to this report.   NEW: Young athletes light the Olympic cauldron after the queen opens the games .\nPerformances pay tribute to British history, literature and music .\nThe Greek delegation leads the parade of athletes into the stadium .\nOrganizers had sought beforehand to keep the event's details secret .

Dataset sizes:
train: 287113 samples
validation: 13368 samples
test: 11490 samples

In [2]:

Copied!





import matplotlib.pyplot as plt
import seaborn as sns

# Compute article lengths (in words)
df_train["article_length"] = df_train["article"].apply(lambda x: len(x.split()))

# Plot distribution
plt.figure(figsize=(10, 6))
sns.histplot(df_train["article_length"], bins=50, kde=True, color="skyblue")
plt.title("Article Length Distribution (in Words)")
plt.xlabel("Number of Words")
plt.ylabel("Number of Articles")
plt.grid(True)
plt.show()

# Print some basic stats
print("\n📊 Article Length Statistics:")
print(df_train["article_length"].describe())
import matplotlib.pyplot as plt
import seaborn as sns

# Compute article lengths (in words)
df_train["article_length"] = df_train["article"].apply(lambda x: len(x.split()))

# Plot distribution
plt.figure(figsize=(10, 6))
sns.histplot(df_train["article_length"], bins=50, kde=True, color="skyblue")
plt.title("Article Length Distribution (in Words)")
plt.xlabel("Number of Words")
plt.ylabel("Number of Articles")
plt.grid(True)
plt.show()

# Print some basic stats
print("\n📊 Article Length Statistics:")
print(df_train["article_length"].describe())

No description has been provided for this image

📊 Article Length Statistics:
count    287113.000000
mean        691.870326
std         336.500292
min           8.000000
25%         443.000000
50%         632.000000
75%         877.000000
max        2347.000000
Name: article_length, dtype: float64

📊 Article Length Insights & Chunking Strategy¶

✅ We analyzed the distribution of article lengths in the CNN/DailyMail dataset.

Average length: ~690 words,
Range: 8 to 2300+ words,
Most articles: ~500–800 words (peak of distribution).

🔍 Why This Matters?¶

Chunking is crucial in RAG pipelines:

If we index articles as-is, long articles may exceed LLM context limits or have too much noise.
Splitting into smaller chunks (e.g., 200–300 words) improves retrieval granularity and relevance.

💡 Next Steps¶

To reduce compute needs (especially for embedding generation), we’ll randomly sample 5000 articles for this demo.
These samples will be chunked and embedded with sentence-transformers to build our retrieval index.

Let’s start by sampling the data and preparing chunks for the RAG pipeline!

In [3]:

Copied!





import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Randomly sample 5,000 articles
sampled_articles = df_train.sample(5000, random_state=42).reset_index(drop=True)

print(f"Sampled {len(sampled_articles)} articles for embedding and RAG indexing.")

# Example chunking function (split article into ~200-word chunks)
def chunk_article(article, chunk_size=200):
    words = article.split()
    chunks = [
        " ".join(words[i: i + chunk_size])
        for i in range(0, len(words), chunk_size)
    ]
    return chunks

# Apply chunking to all articles
sampled_articles["chunks"] = sampled_articles["article"].apply(chunk_article)

# Explode so each row is a chunk
df_chunks = sampled_articles.explode("chunks").reset_index(drop=True)
df_chunks = df_chunks.rename(columns={"chunks": "chunk"})

print(f"Total number of chunks created: {len(df_chunks)}")
print(df_chunks.head(3))
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Randomly sample 5,000 articles
sampled_articles = df_train.sample(5000, random_state=42).reset_index(drop=True)

print(f"Sampled {len(sampled_articles)} articles for embedding and RAG indexing.")

# Example chunking function (split article into ~200-word chunks)
def chunk_article(article, chunk_size=200):
    words = article.split()
    chunks = [
        " ".join(words[i: i + chunk_size])
        for i in range(0, len(words), chunk_size)
    ]
    return chunks

# Apply chunking to all articles
sampled_articles["chunks"] = sampled_articles["article"].apply(chunk_article)

# Explode so each row is a chunk
df_chunks = sampled_articles.explode("chunks").reset_index(drop=True)
df_chunks = df_chunks.rename(columns={"chunks": "chunk"})

print(f"Total number of chunks created: {len(df_chunks)}")
print(df_chunks.head(3))

Sampled 5000 articles for embedding and RAG indexing.
Total number of chunks created: 19844
                                             article  \
0  Nasa has warned of an impending asteroid pass ...   
1  Nasa has warned of an impending asteroid pass ...   
2  Nasa has warned of an impending asteroid pass ...   

                                          highlights  \
0  2004 BL86 will pass about three times the dist...   
1  2004 BL86 will pass about three times the dist...   
2  2004 BL86 will pass about three times the dist...   

                                         id  article_length  \
0  6ccb7278e86893ad3609d30ecb5c9ea902fb9527             524   
1  6ccb7278e86893ad3609d30ecb5c9ea902fb9527             524   
2  6ccb7278e86893ad3609d30ecb5c9ea902fb9527             524   

                                               chunk  
0  Nasa has warned of an impending asteroid pass ...  
1  retiring as manager of NASA's Near Earth Objec...  
2  more.' Asteroid 2004 BL86 was initially discov...

🏗️ Step-by-Step: Set Up a Qdrant Cluster¶

Before you can index and retrieve data in this notebook, you need to set up your Qdrant vector database cluster.

1️⃣ Create a Free Qdrant Account¶

✅ Go to https://cloud.qdrant.io/ and create an account.
✅ Choose the Free Tier (plenty for small experiments like this).

2️⃣ Create a New Cluster¶

Click on "Create Cluster".
Choose:
- Region: 🇪🇺 Europe (EU) (to keep data residency within the EU for privacy compliance).
- Cluster size: Small (free tier).
Name your cluster, e.g., rag-demo-cluster.

3️⃣ Keep Your Cluster Info¶

Once created, you’ll get:

🔑 What	📋 Example / Note
🌐 Cluster URL	`https://xxxx-xxxx-xxxx.qdrant.eu`
🔑 API Key	A long string of letters & numbers (keep secret!)

These are your credentials to connect from Python (or any SDK).

4️⃣ Save Credentials in `.env` File¶

To keep them safe and avoid accidentally sharing them in notebooks:

echo "QDRANT_URL=https://xxxx-xxxx-xxxx.qdrant.eu" >> .env
echo "QDRANT_API_KEY=your-long-api-key" >> .env

✅ Tip: Use .gitignore to never upload .env to GitHub!

5️⃣ Use Credentials in Python¶

In your notebook:

from dotenv import load_dotenv
import os

load_dotenv()
QDRANT_URL = os.getenv("QDRANT_URL")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")

qdrant = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)

Let's index the data and load them on your QDRANT cluster.

In [4]:

Copied!





from dotenv import load_dotenv
import os
from qdrant_client import QdrantClient, models as qdrant_models

load_dotenv()
QDRANT_URL = os.getenv("QDRANT_URL")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")

qdrant = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
from dotenv import load_dotenv
import os
from qdrant_client import QdrantClient, models as qdrant_models

load_dotenv()
QDRANT_URL = os.getenv("QDRANT_URL")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")

qdrant = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)

🏗️ Step-by-Step: Embedding and Indexing in Qdrant¶

This Python cell covers three key stages:

1️⃣ Compute vector embeddings for all text chunks.
2️⃣ Create a Qdrant collection to store these embeddings.
3️⃣ Upsert (upload) data with metadata for retrieval.

1️⃣ Compute Chunk Embeddings¶

embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

Loads the MiniLM model from sentence-transformers – a fast, lightweight model for semantic embeddings.

chunk_texts = df_chunks["chunk"].tolist()
embeddings = embedding_model.encode(chunk_texts, show_progress_bar=True, batch_size=64)

chunk_texts: Extracts the text of each chunk.
encode: Converts texts to dense vector embeddings (numerical representations of meaning).
batch_size=64: Processes 64 chunks at a time (GPU-friendly!).
show_progress_bar=True: Shows progress to help track long jobs.

2️⃣ Create a Qdrant Collection¶

collection_name = "cnn_chunks"
qdrant.recreate_collection(
    collection_name=collection_name,
    vectors_config=qdrant_models.VectorParams(
        size=embeddings.shape[1],
        distance=qdrant_models.Distance.COSINE
    )
)

collection_name: Name of our database collection (cnn_chunks).
recreate_collection: Deletes any existing collection with this name and creates a new one.
vectors_config:
- size: Dimensionality of our embeddings (e.g., 384 for MiniLM).
- distance: COSINE similarity metric – ideal for semantic similarity!

3️⃣ Prepare Data for Upsert (Insert)¶

points = []
for idx, (embedding, chunk) in enumerate(zip(embeddings, chunk_texts)):
    points.append(
        qdrant_models.PointStruct(
            id=idx,
            vector=embedding.tolist(),
            payload={
                "chunk": chunk,
                "source_article_id": int(df_chunks.iloc[idx]["id"])  # optional: track source
            }
        )
    )

Loops through all (embedding, chunk) pairs:
- id: Unique integer ID for each chunk (used by Qdrant).
- vector: The actual embedding (converted to a list).
- payload: Additional info (the chunk text itself, plus optional source info).

✅ Why store payload? It lets us trace retrieved chunks back to the source article!

4️⃣ Upload Data to Qdrant¶

qdrant.upsert(collection_name=collection_name, points=points)

upsert: Insert (or update) the points in the Qdrant collection.

✅ Next: Let’s run a few retrieval queries and compare LLM-only vs. RAG answers!

In [9]:

Copied!





from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Compute embeddings for chunks
print("🔍 Computing embeddings for all chunks (this might take a bit)...")
chunk_texts = df_chunks["chunk"].tolist()
embeddings = embedding_model.encode(chunk_texts, show_progress_bar=True, batch_size=64)

# Create a Qdrant collection
collection_name = "cnn_chunks"
qdrant.recreate_collection(
    collection_name=collection_name,
    vectors_config=qdrant_models.VectorParams(
        size=embeddings.shape[1],  # Dimensionality of embedding model
        distance=qdrant_models.Distance.COSINE
    )
)

# Prepare data for upsert
points = []
for idx, (embedding, chunk) in enumerate(zip(embeddings, chunk_texts)):
    points.append(
        qdrant_models.PointStruct(
            id=idx,
            vector=embedding.tolist(),
            payload={
                "chunk": chunk,
                "source_article_id": df_chunks.iloc[idx]["id"]
            }
        )
    )

# Upsert into Qdrant
print("💾 Inserting embeddings into Qdrant...")
batch_size = 250
total_points = len(points)

for i in range(0, total_points, batch_size):
    batch_points = points[i:i + batch_size]
    batch_end = min(i + batch_size, total_points)
    
    print(f"   Uploading batch {i//batch_size + 1}/{(total_points + batch_size - 1)//batch_size} "
          f"(points {i+1}-{batch_end})")
    
    qdrant.upsert(collection_name=collection_name, points=batch_points)



print("\n✅ All embeddings uploaded to Qdrant!")
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Compute embeddings for chunks
print("🔍 Computing embeddings for all chunks (this might take a bit)...")
chunk_texts = df_chunks["chunk"].tolist()
embeddings = embedding_model.encode(chunk_texts, show_progress_bar=True, batch_size=64)

# Create a Qdrant collection
collection_name = "cnn_chunks"
qdrant.recreate_collection(
    collection_name=collection_name,
    vectors_config=qdrant_models.VectorParams(
        size=embeddings.shape[1],  # Dimensionality of embedding model
        distance=qdrant_models.Distance.COSINE
    )
)

# Prepare data for upsert
points = []
for idx, (embedding, chunk) in enumerate(zip(embeddings, chunk_texts)):
    points.append(
        qdrant_models.PointStruct(
            id=idx,
            vector=embedding.tolist(),
            payload={
                "chunk": chunk,
                "source_article_id": df_chunks.iloc[idx]["id"]
            }
        )
    )

# Upsert into Qdrant
print("💾 Inserting embeddings into Qdrant...")
batch_size = 250
total_points = len(points)

for i in range(0, total_points, batch_size):
    batch_points = points[i:i + batch_size]
    batch_end = min(i + batch_size, total_points)
    
    print(f"   Uploading batch {i//batch_size + 1}/{(total_points + batch_size - 1)//batch_size} "
          f"(points {i+1}-{batch_end})")
    
    qdrant.upsert(collection_name=collection_name, points=batch_points)



print("\n✅ All embeddings uploaded to Qdrant!")

🔍 Computing embeddings for all chunks (this might take a bit)...

/var/folders/2z/g737jg9d2jj206wkf56g7gyc0000gn/T/ipykernel_40961/1248717893.py:12: DeprecationWarning: `recreate_collection` method is deprecated and will be removed in the future. Use `collection_exists` to check collection existence and `create_collection` instead.
  qdrant.recreate_collection(

💾 Inserting embeddings into Qdrant...
   Uploading batch 1/80 (points 1-250)
   Uploading batch 2/80 (points 251-500)
   Uploading batch 3/80 (points 501-750)
   Uploading batch 4/80 (points 751-1000)
   Uploading batch 5/80 (points 1001-1250)
   Uploading batch 6/80 (points 1251-1500)
   Uploading batch 7/80 (points 1501-1750)
   Uploading batch 8/80 (points 1751-2000)
   Uploading batch 9/80 (points 2001-2250)
   Uploading batch 10/80 (points 2251-2500)
   Uploading batch 11/80 (points 2501-2750)
   Uploading batch 12/80 (points 2751-3000)
   Uploading batch 13/80 (points 3001-3250)
   Uploading batch 14/80 (points 3251-3500)
   Uploading batch 15/80 (points 3501-3750)
   Uploading batch 16/80 (points 3751-4000)
   Uploading batch 17/80 (points 4001-4250)
   Uploading batch 18/80 (points 4251-4500)
   Uploading batch 19/80 (points 4501-4750)
   Uploading batch 20/80 (points 4751-5000)
   Uploading batch 21/80 (points 5001-5250)
   Uploading batch 22/80 (points 5251-5500)
   Uploading batch 23/80 (points 5501-5750)
   Uploading batch 24/80 (points 5751-6000)
   Uploading batch 25/80 (points 6001-6250)
   Uploading batch 26/80 (points 6251-6500)
   Uploading batch 27/80 (points 6501-6750)
   Uploading batch 28/80 (points 6751-7000)
   Uploading batch 29/80 (points 7001-7250)
   Uploading batch 30/80 (points 7251-7500)
   Uploading batch 31/80 (points 7501-7750)
   Uploading batch 32/80 (points 7751-8000)
   Uploading batch 33/80 (points 8001-8250)
   Uploading batch 34/80 (points 8251-8500)
   Uploading batch 35/80 (points 8501-8750)
   Uploading batch 36/80 (points 8751-9000)
   Uploading batch 37/80 (points 9001-9250)
   Uploading batch 38/80 (points 9251-9500)
   Uploading batch 39/80 (points 9501-9750)
   Uploading batch 40/80 (points 9751-10000)
   Uploading batch 41/80 (points 10001-10250)
   Uploading batch 42/80 (points 10251-10500)
   Uploading batch 43/80 (points 10501-10750)
   Uploading batch 44/80 (points 10751-11000)
   Uploading batch 45/80 (points 11001-11250)
   Uploading batch 46/80 (points 11251-11500)
   Uploading batch 47/80 (points 11501-11750)
   Uploading batch 48/80 (points 11751-12000)
   Uploading batch 49/80 (points 12001-12250)
   Uploading batch 50/80 (points 12251-12500)
   Uploading batch 51/80 (points 12501-12750)
   Uploading batch 52/80 (points 12751-13000)
   Uploading batch 53/80 (points 13001-13250)
   Uploading batch 54/80 (points 13251-13500)
   Uploading batch 55/80 (points 13501-13750)
   Uploading batch 56/80 (points 13751-14000)
   Uploading batch 57/80 (points 14001-14250)
   Uploading batch 58/80 (points 14251-14500)
   Uploading batch 59/80 (points 14501-14750)
   Uploading batch 60/80 (points 14751-15000)
   Uploading batch 61/80 (points 15001-15250)
   Uploading batch 62/80 (points 15251-15500)
   Uploading batch 63/80 (points 15501-15750)
   Uploading batch 64/80 (points 15751-16000)
   Uploading batch 65/80 (points 16001-16250)
   Uploading batch 66/80 (points 16251-16500)
   Uploading batch 67/80 (points 16501-16750)
   Uploading batch 68/80 (points 16751-17000)
   Uploading batch 69/80 (points 17001-17250)
   Uploading batch 70/80 (points 17251-17500)
   Uploading batch 71/80 (points 17501-17750)
   Uploading batch 72/80 (points 17751-18000)
   Uploading batch 73/80 (points 18001-18250)
   Uploading batch 74/80 (points 18251-18500)
   Uploading batch 75/80 (points 18501-18750)
   Uploading batch 76/80 (points 18751-19000)
   Uploading batch 77/80 (points 19001-19250)
   Uploading batch 78/80 (points 19251-19500)
   Uploading batch 79/80 (points 19501-19750)
   Uploading batch 80/80 (points 19751-19844)

✅ All embeddings uploaded to Qdrant!

🔍 Retriever Step – Finding Relevant Chunks for a Query¶

In a Retrieval-Augmented Generation (RAG) pipeline, the retriever is the part that:

✅ Takes a user question (like “What happened in the 2020 Olympics?”),
✅ Encodes it into an embedding (using the same sentence-transformer model as the document chunks),
✅ Searches for the most similar text chunks in our Qdrant index.

🧩 Why It Matters¶

The retriever is the foundation of RAG:

🧠 It finds the most relevant evidence in the knowledge base.
📚 It ensures that the LLM has factual grounding for its final answer.
⚡️ Efficient retrievers (like Qdrant) make this process super fast, even with thousands of documents.

⚙️ How It Works in Our Notebook¶

1️⃣ Encode the user query to a dense embedding (same as we did for article chunks).
2️⃣ Send this embedding to Qdrant’s search API.
3️⃣ Get the top-k chunks ranked by cosine similarity.

✅ Let’s build this retriever function in Python!

In [16]:

Copied!





def retrieve_top_k_chunks(question, k=10):
    """
    Retrieve the top-k relevant chunks for a user question from Qdrant.
    """
    # 1️⃣ Encode the question
    question_embedding = embedding_model.encode(question).tolist()

    # 2️⃣ Query Qdrant
    search_results = qdrant.search(
        collection_name=collection_name,
        query_vector=question_embedding,
        limit=k
    )

    # 3️⃣ Extract the retrieved chunks
    retrieved_chunks = []
    for result in search_results:
        retrieved_chunks.append({
            "chunk": result.payload["chunk"],
            "score": result.score
        })

    return retrieved_chunks

# Test the retriever with a sample question
test_question = "What reasons did George Pataki give for stepping down from public office?"
top_chunks = retrieve_top_k_chunks(test_question, k=10)

print("\n📝 Top Retrieved Chunks:")
for idx, chunk_info in enumerate(top_chunks, 1):
    print(f"\nChunk {idx} (score: {chunk_info['score']:.4f}):\n{chunk_info['chunk']}")
def retrieve_top_k_chunks(question, k=10):
    """
    Retrieve the top-k relevant chunks for a user question from Qdrant.
    """
    # 1️⃣ Encode the question
    question_embedding = embedding_model.encode(question).tolist()

    # 2️⃣ Query Qdrant
    search_results = qdrant.search(
        collection_name=collection_name,
        query_vector=question_embedding,
        limit=k
    )

    # 3️⃣ Extract the retrieved chunks
    retrieved_chunks = []
    for result in search_results:
        retrieved_chunks.append({
            "chunk": result.payload["chunk"],
            "score": result.score
        })

    return retrieved_chunks

# Test the retriever with a sample question
test_question = "What reasons did George Pataki give for stepping down from public office?"
top_chunks = retrieve_top_k_chunks(test_question, k=10)

print("\n📝 Top Retrieved Chunks:")
for idx, chunk_info in enumerate(top_chunks, 1):
    print(f"\nChunk {idx} (score: {chunk_info['score']:.4f}):\n{chunk_info['chunk']}")

/var/folders/2z/g737jg9d2jj206wkf56g7gyc0000gn/T/ipykernel_40961/2354009276.py:9: DeprecationWarning: `search` method is deprecated and will be removed in the future. Use `query_points` instead.
  search_results = qdrant.search(

📝 Top Retrieved Chunks:

Chunk 1 (score: 0.6046):
(CNN) -- Choosing to step down from a top job can be an extraordinary decision, whether the person is a pontiff or a politician. But George Pataki, former governor of New York, says making the switch from public figure to John Q. Public wasn't difficult for him. "I made up my mind that I was never going to let my public title become my personal identity," he says. He embraced what he calls a sense of normalcy after he left office, going to movies and basketball games. A year or two after he left office, Pataki went to Madison Square Garden with a group of friends to see the Knicks play. And he wanted to stand in line to get himself a hot dog -- something elected officials tend not to do. "I loved it," he says. Even though fellow fans recognized him and offered to let him jump the queue, Pataki waited in line for his hot dog with mustard and sauerkraut. "I felt really good about the fact that it was just comfortable for me to be on line with the rest," he says. Pataki decided in the middle of his third term in office that he would

Chunk 2 (score: 0.5574):
not seek a fourth term. He left office in 2006, after 12 years as governor. Pope's resignation a new angle to a tough news beat . "I had no doubts that this was the right decision for me, for my family, for the team that had worked so hard with me, and for the state," he says. Pataki now practices law at Chadbourne & Parke in New York, where he focuses on energy and environmental issues. He has enjoyed his return to the private sector. "The transition for me was really not that difficult, to be perfectly honest," says Pataki. "But it was time." Deciding when it's time to leave can be tough, especially when top jobs are hard to find. In the corporate world, stepping down from a leadership role by choice is uncommon. "It's almost unheard of," says Patricia Cook, CEO of the executive search firm Cook & Company. "It's a very unusual event." Experts in the business world wonder why an executive would relinquish a leadership role. Cook says she can't think of an instance when a CEO stepped down because he or she didn't feel up to the job. But health concerns can change the game.

Chunk 3 (score: 0.4620):
the top of their game. "Effective leadership is all about being able to execute brilliantly," says Battley. Whether the person is in charge of a Fortune 500 company, or the world's 1.2 billion Catholics, leadership is exhausting. Since leaving office, Pataki and his wife spend as much time as they can at their farm in the Adirondack Mountains. He's still playing sports, like basketball, and watching them; he calls himself a long-suffering Jets fan. And he still appears on the political stage. He says: "I do want to try to help out with advancing policies that are right for the future of the country." As for the pope, a Vatican spokesman says he will likely retire to a monastery, where he will spend his time focused on prayer and reflection.

Chunk 4 (score: 0.4039):
of the video, Rice unloads a homophobic rant against one of his players, saying: 'You f***ing fairy! You're a f***ing f****t!' 'I commend President Barchi for his . decisive leadership in coming to an agreement with Mr. Pernetti to have . the Athletic Department of Rutgers University come under new . leadership,' he said. 'This entire incident was regrettable . and while it has damaged the reputation of our state University, we need . to move forward now on a number of fronts which provide great . opportunities for Rutgers' future.' Pernetti said in his resignation . letter to Barchi that he has 'spent a great deal of time reflecting on . the events which led to today. As you know, my first instincts when I . saw the videotape of Coach Rice's behavior was to fire him immediately.' 'However, Rutgers decided to follow a process involving university lawyers, human resources professionals, and outside counsel. 'Following review of the independent . investigative report, the consensus was that university policy would not . justify dismissal. I have admitted my role in, and regret for, that . decision, and wish that I had the opportunity to go back and override it .

Chunk 5 (score: 0.3963):
when Eliot Spitzer resigned in 2008 over a prostitution scandal, has abandoned his campaign for election to a full term, saying it was not the "latest distraction but an accumulation" of obstacles behind his decision. CNN's Mark J. Norman contributed to this report.

Chunk 6 (score: 0.3951):
appears to be safe. 'At the end of the day, he has to run . this place, day in and day out,' Ralph Izzo, chairman of the school's . board of governors, said. 'And I think he is the right person to run this place for many years to come. 'Dr. Barchi was brought on here eight . months ago with two primary objectives: No. 1 was to build a strategic . plan for this university for 10 years, going forward, to lead us to . academic success and academic greatness; and No. 2, an enormous . challenge of integrating a medical school with this university. 'Being on the job two months, hearing . from a general counsel and the athletic director that there was a . serious problem, I think he did the right thing by acquiescing to that . advice at the time.' Governor Chris Christie issued a . statement Friday calling Pernetti's resignation 'appropriate and . necessary given the events of the past six months. Axed: Former Rutgers coach Mike Rice was fired on April 3 after a video emerged of him shoving players and berating them with insults and gay slurs . Slurs: In one portion

Chunk 7 (score: 0.3891):
whose husband, Michael Patrick, was a bond trader who worked in the World Trade Center. "We'll never forget what happened here." "I'm proud today," she said. Former President George W. Bush, who was in office at the time of the 9/11 attacks, did not appear alongside his successor during Thursday's visit to New York. A spokesman for Bush said he turned down an invitation to attend, citing his desire to remain out of the public spotlight. The image of the former president beside firefighters at ground zero once became an emblem of America's resolve in the war on terror, a term the Obama administration has since tried to distance itself from. Meanwhile, Obama's visit to the hallowed site nearly 10 years later left many New Yorkers eager to witness the closing chapter of a man once considered the world's most wanted terrorist.

Chunk 8 (score: 0.3850):
(CNN) -- As New Jersey Gov. Chris Christie fights back against the biggest political controversy of his career, he's under fire, as expected, from opportunistic attacks from the left. Begala: Three reasons bridge scandal will stick . Cupp: Christie apology hits all the right notes . But there are plenty within his own party who may also be pleased to see the tough-talking Republican governor get a bit of a comeuppance. The party's conservative base has never warmed to Christie. And he angered other Republicans with his 2012 Republican National Convention speech that was more about him than the party's nominee Mitt Romney. And some will never forgive him for his public embrace of President Barack Obama who was surveying damage in New Jersey from Superstorm Sandy just days before the presidential election. Archives: Christie, Obama set to meet again on Jersey Shore . Christie took the first steps Thursday toward rehabilitation -- apologizing profusely and announcing that he had fired two close aides connected the closing of access lanes to the George Washington Bridge -- the nation's busiest -- to punish the mayor of Fort Lee, New Jersey, for not endorsing him in his re-election bid last year. The

Chunk 9 (score: 0.3804):
plurality of forces working together. I'm sure that the interest of the public, in general, all over America had something, a great deal to do with it." In fact, Atlanta Mayor William Hartsfield was working to negotiate King's release from incarceration, which began with his arrest during a protest eight days earlier, according to Taylor Branch's historical account in his book, "Parting the Waters." "Now, it is true that Sen. Kennedy did take a specific step," King said. "He was in contact with officials in Georgia during my arrest and he called my wife, made a personal call and expressed his concern and said to her that he was working and trying to do something to make my release possible." John F. Kennedy made the call to King's wife at the urging of his brother-in-law Sargent Shriver, but out of the presence of campaigns aides who were concerned it could cost him southern support in the election 13 days away, Branch wrote. A special class with MLK . Robert Kennedy, who was initially upset when he found out about the call, reversed himself later that day and placed his own call to the judge, Branch wrote. King told the interviewer

Chunk 10 (score: 0.3791):
its own. After nearly 7 years, our family should not be struggling to get through each day without this wonderful, caring, man that we love so much." Officials contacted by CNN on Thursday declined to comment on any alleged ties between Levinson and the U.S. government. "We have no comment on any purported affiliation between Mr. Levinson and the U.S. Government," CIA spokesman Chris White said. "The U.S. Government remains committed to bringing him home safely to his family." National Security Council spokeswoman Caitlin Hayden criticized the AP for publishing the story and said it "does nothing to further the cause of bringing him home." "Without commenting on any purported affiliation between Mr. Levinson and the U.S. government, the White House and others in the U.S. Government strongly urged the AP not to run this story out of concern for Mr. Levinson's life," she said. "We regret that the AP would choose to run a story that does nothing to further the cause of bringing him home. The investigation into Mr. Levinson's disappearance continues, and we all remain committed to finding him and bringing him home safely to his family." Other detained Americans . AP: 'One of the biggest scandals

📝 Comments on Retrieved Chunks¶

✅ For Question: George Pataki’s reasons for stepping down and post-office life The top 3 chunks (scores 0.60–0.46) directly address:

1️⃣ Pataki’s mindset for stepping down (“I was never going to let my public title become my personal identity,” embracing normalcy, etc.). 2️⃣ His decision not to seek a fourth term and smooth transition to private life. 3️⃣ Details about his new activities (farming, basketball, law practice).

🟢 Great retrieval! The top chunks are directly relevant, well-aligned with the question, and summarize the transition in his career.

In [17]:

Copied!





# Test the retriever with a sample question
test_question = "Why did Kaci Hickox challenge the quarantine order, and what are the modern alternatives to quarantine?"
top_chunks = retrieve_top_k_chunks(test_question, k=10)

print("\n📝 Top Retrieved Chunks:")
for idx, chunk_info in enumerate(top_chunks, 1):
    print(f"\nChunk {idx} (score: {chunk_info['score']:.4f}):\n{chunk_info['chunk']}")
# Test the retriever with a sample question
test_question = "Why did Kaci Hickox challenge the quarantine order, and what are the modern alternatives to quarantine?"
top_chunks = retrieve_top_k_chunks(test_question, k=10)

print("\n📝 Top Retrieved Chunks:")
for idx, chunk_info in enumerate(top_chunks, 1):
    print(f"\nChunk {idx} (score: {chunk_info['score']:.4f}):\n{chunk_info['chunk']}")

/var/folders/2z/g737jg9d2jj206wkf56g7gyc0000gn/T/ipykernel_40961/2354009276.py:9: DeprecationWarning: `search` method is deprecated and will be removed in the future. Use `query_points` instead.
  search_results = qdrant.search(

📝 Top Retrieved Chunks:

Chunk 1 (score: 0.6005):
which has killed some 5,000 people there. New rules: New York Governor Andrew Cuomo, left, listens as New Jersey Governor Chris Christie talks at a news conference on Friday. The governors announced a mandatory quarantine for 'high risk' people . She wrote: 'I had spent a month watching children die, alone. I had witnessed human tragedy unfold before my eyes. 'I had tried to help when much of the world has looked on and done nothing... I sat alone in the isolation tent and thought of many colleagues who will return home to America and face the same ordeal. Will they be made to feel like criminals and prisoners?' She continued: 'The epidemic continues to ravage West Africa... We need more health care workers to help fight the epidemic in West Africa. The U.S. must treat returning health care workers with dignity and humanity.' Hickox is the first person to fall under the new quarantine rules, which cover the states of New York and New Jersey, and were announced yesterday by New Jersey Governor Chirs Christie and his New York counterpart Andrew Cuomo. Any person traveling from the three West African nations who had contact with infected, or possibly infected,

Chunk 2 (score: 0.5927):
A Maine nurse who battled politicians over her quarantine after she returned from treating Ebola patients in West Africa said she will continue speaking out on behalf of public health workers. Monday marks the 21st day since Kaci Hickox's last exposure to an Ebola patient, a ten-year-old girl who suffered seizures before dying alone without family. On Tuesday, Hickox will no longer require daily monitoring for Ebola symptoms, and said she looks forward to stepping out her front door 'like normal people.' Free to go: On Tuesday, Kaci Hickox will no longer require daily monitoring for Ebola symptoms, and said she looks forward to stepping out her front door 'like normal people' But the Texas native said she won't back away from the debate over treatment of health care workers. 'In the past, a quarantine was something that was considered very extreme. I'm concerned about how lightly we're taking this concept today,' said Hickox, who defied state-ordered quarantine attempts in New Jersey and Maine. 'I'm concerned that the wrong people are leading the debate and making the decisions.' She said the U.S. needs a public education campaign to better explain the virus that has killed nearly 5,000 in Liberia, Sierra

Chunk 3 (score: 0.5631):
(CNN) -- Last week, Kaci Hickox, the nurse isolated and tested for Ebola in New Jersey when she returned from working with patients in Sierra Leone, came to a settlement in her legal battle with Maine over a forced quarantine in that state. A District Court judge ordered that she cooperate with "direct active monitoring," coordinate her movements with public health authorities and immediately report any symptoms but didn't otherwise restrict her movement. But was this the correct decision for the time -- based on medicine, scientific knowledge and best public health practices? The panic response in much of the United States, as health workers and others have returned from Ebola-stricken West Africa, has pitted public officials against doctors against whole communities over the issue of quarantine. Who is right? What is best? Medical quarantines are as old as "the plague." They were used in the 14th century when ships arriving in Venice from plague-infected ports were forced to lay at anchor for quaranta giorni -- Italian for 40 days -- before landing. Quarantines quickly became the default mechanism for controlling outbreaks of untreatable diseases. But are these measures still needed, effective and appropriate in the 21st century? More to

Chunk 4 (score: 0.5417):
Quarantined: Kaci Kickox was put in compulsory isolation when she came back to the U.S. from Sierra Leone . A nurse put under mandatory quarantine at a U.S. airport when she came back from fighting Ebola has criticized new measures to contain the disease, saying she has been treated like a criminal. Kaci Hickox, a nurse with Doctors Without Borders, arrived at Liberty Newark Airport in New Jersey yesterday after returning from an assignment in Sierra Leone, one of the West African nations hardest-hit by the outbreak. But she has said that when she arrived she met a scene of chaos and suspicion, where officials left her in the dark, falsely declared she was feverish then bundled her into an emergency motorcade to hospital. On arriving at University Hospital in Newark, Hickox tested negative for Ebola - but must remain under quarantine for the next three weeks under new rules. It follows widespread concern about how Dr Craig Spencer, another Doctors Without Borders worker, was allowed to go free in New York for a week before it emerged that he had Ebola. Hickox, writing for the Dallas News, said that she was held at immigration for three hours while a

Chunk 5 (score: 0.4922):
Leone and Guinea. However, Hickox said she wouldn't let her experience prevent her from returning to West Africa. Getting life back on track: Hickox said she plans to have dinner with her boyfriend Ted Wilbur to mark the end of the deadly disease's incubation period, but she's not sure what kind of reception she'll get . 'Something like quarantine is not going to scare me from doing the work that I love,' she told The Associated Press from her home in Fort Kent in northern Maine. 'I would return to Sierra Leone in a heartbeat.' Hickox said she plans to have dinner with her boyfriend to mark the end of the deadly disease's incubation period, but she's not sure what kind of reception she'll get. She has been hailed by some and vilified by others for refusing to be quarantined. Most people have been supportive, she said, but others have been hateful. She received a letter from one person who said he hoped she would catch Ebola and die. 'We're still thankful we've had a lot of great support in this community but I'd be lying if I said that it didn't make me a little bit nervous thinking about

Chunk 6 (score: 0.4828):
individuals and families, and of course the social and emotional costs of not instituting quarantine. Certainly, there are political costs for instituting or not instituting quarantine, and there is a chilling effect on volunteerism for hazardous duty in the hot zone if volunteers can expect to be quarantined on their return home. In addition, there are practical issues such as the nature of the quarantine (voluntary vs. involuntary), the length of quarantine, the locations of quarantine and alternatives to quarantine. But beyond all of these things, there are science and technology -- and this is critical. At a time when we have 21st-century tools and knowledge to decrease the world's -- and the U.S. -- risk from Ebola, to preserve public freedom as best as possible, and to act to reduce not only the health risk but also the economic and social toll from this outbreak, can't we come up with a better solution than quarantine? Yes, we can. Science, coupled with our on-the-ground experience in West Africa, demonstrates that a person who has the Ebola virus inside his or her body but who has not yet developed any symptoms does not have enough virus to share and is not

Chunk 7 (score: 0.4517):
the point, why has the U.S. Defense Department decided it will quarantine military personnel returning from Ebola-stricken areas of West Africa for 21 days, while the federal government has decided not to enforce quarantine, and multiple states have implemented various quarantine procedures? Legally, the federal government has the authority to quarantine. U.S. Code grants the Public Health Service quarantine and inspection authority, and similar laws exist in most state constitutions. However, these powers have rarely been used, in part because advances in medicine have made such measures superfluous. In 2014, we must evaluate laws, treatments and public health measures and adapt them to the times. We must ask if this ancient measure is applicable and reasonable during this Ebola public health emergency. There are multiple considerations; public safety is a big one. But then there are costs: economic costs of instituting quarantine to the individual and his or her family and employer, and to local, state and national government; economic costs of not instituting quarantine -- i.e., the cost of providing health care if additional cases of Ebola arise. There is a price to pay in individual freedom as well as for the social and emotional effects of quarantine to

Chunk 8 (score: 0.4317):
and her nephews must wait 21 days from when Duncan first showed symptoms before they can leave the apartment. That's because Ebola can be in a person for that long before it manifests itself, and someone starts to feel sick. Reflecting on it all, Louise said Thursday, "I'm just hanging in there, depending on God to save our lives." County official: Quarantined four should be relocated . If it were up to the Dallas County director of homeland security, the four people quarantined shouldn't be stuck in the apartment at all. Judge Clay Jenkins, also director of the county's Homeland Security and Emergency Management, said officials are working on that relocation after Duncan's partner told CNN of being forced to live with distressing living conditions. Jenkins acknowledged "some hygiene issues" in the apartment. "I would like to see those people moved to better living conditions," Jenkins told CNN's Jake Tapper on Thursday afternoon. "We are working on that. I would like to move them five minutes ago." Jenkins acknowledged problems with Louise's apartment but defended the overall government response. "We have some hygiene issues that we are addressing in that apartment," Jenkins said earlier in the day. "Those people in

Chunk 9 (score: 0.4205):
people will be automatically quarantined for 21 days. This includes doctors. It will be coordinated with local health departments. Governor Christie tweeted on Friday: 'Today, a healthcare worker arrived at Newark Airport, w/ a recent history of treating patients w/ Ebola in West Africa, but w/ no symptoms.' He said that the New Jersey Department of Health determined that a legal quarantine order should be issued. He added: 'This woman, while her home residence is outside of this area, her next stop was going to be here in NY.' However, the rules have not been met with universal support. New York City mayor Bill de Blasio sounded a note of caution over a 'chilling effect' on innocent doctors and nurses. A spokesman for the mayor said: 'The mayor wants to work closely with our state partners, but he wants to make sure that there will not be any sort of chilling effect on medical workers who might want to go over to help.' The New Jersey Department of Health was not available to comment. September 16: Dr Craig Spencer flew to Guinea to treat Ebola patients as a member of the French organization Doctors Without Borders (Medecins Sans Frontiers) October

Chunk 10 (score: 0.4202):
"That worries me now, yes." He's not alone. Dr. Irwin Redlener, a professor at Columbia University's school of public health, called the handling of the quarantine "hair raising." "If, in fact, there was a health worker there every day, what exactly were they doing other than just taking her temperature and leaving?" Redlener said. "It should have been a whole lot more that (Louse) should be expecting." Ebola only spreads through infected bodily fluids. And it's distinctly possible that no one around Duncan got exposed to his. (Louise said she didn't think she had.) Still, CNN's Dr. Sanjay Gupta said the continuing presence of the sheets, on which Duncan may have transmitted the virus through sweating, are disturbing. "We've talked about the fact that this virus can live outside the body, can live on surfaces. It's unlikely for it to be transmitted to someone else that way," Gupta said. "But why take a chance?" Gupta said that "it is hard to believe (the oversight) and there aren't good explanations here." One Ebola expert, Dr. Alexander van Tulleken, also said the federal response to the first Ebola case on U.S. soil seemed troubling. "So far, we don't seem to reacting as

✅ For Question: Kaci Hickox’s challenge to quarantine & modern alternatives The top 3 chunks (scores 0.60–0.56) highlight:

1️⃣ Kaci Hickox’s critique of her quarantine (felt treated like a criminal, wanted to be treated with dignity). 2️⃣ Her stance as a nurse and how she spoke out against quarantine for returning healthcare workers. 3️⃣ Broader discussion on modern quarantine alternatives (like electronic wrist devices for monitoring, the 21st-century science-based approach).

🟢 Excellent coverage: The retriever found chunks that directly mention her challenge to quarantine (social/political implications) and the modern science-based alternatives she supported.

🔍 Observations & Takeaways¶

✅ Retriever Performance:

Chunks 1–3 for both questions are highly relevant and factually aligned with the original article.
Scores drop sharply after chunk 3 (chunks 4–10 cover other articles, random topics, or less relevant details).
Shows that a top-k of 3–5 is ideal for RAG in this case — beyond that, noise increases.

✅ Use in RAG:

Top 3–5 chunks will give the LLM high-quality, focused evidence for grounded answers.
Less relevant chunks (e.g., 4–10) wouldn’t be included for RAG context!

✅ Next Steps:

Pass the top 3 chunks as context to the LLM for answer generation,
Compare to LLM-only generation for the same question (no retrieval).
See how adding retrieved evidence improves factual accuracy!

🧩 Building a RAG Prompt for LLM Answer Generation¶

✅ In a Retrieval-Augmented Generation (RAG) setup, the retrieved chunks serve as factual context for the LLM.

✅ The LLM is tasked with answering the user question based only on the retrieved evidence, ensuring more grounded, reliable responses.

🔍 How the Prompt is Structured¶

1️⃣ System message:

Clearly states the LLM’s role as a factual question-answering assistant.

2️⃣ User message:

Includes:
- The question
- The top retrieved chunks as context
- A clear instruction to generate an answer only from the provided context.

⚠️ Why It Matters?¶

✅ Prevents the LLM from hallucinating or adding information that isn’t in the retrieved documents.
✅ Forces the LLM to focus on the evidence, like a student citing notes!

Let’s write a Python function that builds this prompt and generates an answer.

In [20]:

Copied!





from litellm import completion
import textwrap

def generate_rag_answer(question, retrieved_chunks, model_name="gpt-4o-mini", temperature=0.2):
    """
    Generates an answer using an LLM, grounded in retrieved chunks.
    """
    # Combine retrieved chunks as context
    context = "\n\n".join([f"Chunk {i+1}:\n{chunk['chunk']}" for i, chunk in enumerate(retrieved_chunks)])

    # RAG-style prompt
    prompt = f"""
You are a helpful, factual question-answering assistant. Please answer the question below
*only* using the provided context. If the context does not contain enough information, say so explicitly.

Question: {question}

Context:
{context}

Answer:"""

    messages = [
        {"role": "system", "content": "You are a helpful, factual question-answering assistant."},
        {"role": "user", "content": prompt}
    ]

    response = completion(
        model=model_name,
        messages=messages,
        temperature=temperature
    )
    answer = response["choices"][0]["message"]["content"].strip()
    return answer

question =  "What reasons did George Pataki give for stepping down from public office?"
top_chunks = retrieve_top_k_chunks(question, k=3)
rag_answer_1 = generate_rag_answer(question, top_chunks)
print("💡 RAG-Generated Answer:")
print(textwrap.fill(rag_answer_1, width=80))
from litellm import completion
import textwrap

def generate_rag_answer(question, retrieved_chunks, model_name="gpt-4o-mini", temperature=0.2):
    """
    Generates an answer using an LLM, grounded in retrieved chunks.
    """
    # Combine retrieved chunks as context
    context = "\n\n".join([f"Chunk {i+1}:\n{chunk['chunk']}" for i, chunk in enumerate(retrieved_chunks)])

    # RAG-style prompt
    prompt = f"""
You are a helpful, factual question-answering assistant. Please answer the question below
*only* using the provided context. If the context does not contain enough information, say so explicitly.

Question: {question}

Context:
{context}

Answer:"""

    messages = [
        {"role": "system", "content": "You are a helpful, factual question-answering assistant."},
        {"role": "user", "content": prompt}
    ]

    response = completion(
        model=model_name,
        messages=messages,
        temperature=temperature
    )
    answer = response["choices"][0]["message"]["content"].strip()
    return answer

question =  "What reasons did George Pataki give for stepping down from public office?"
top_chunks = retrieve_top_k_chunks(question, k=3)
rag_answer_1 = generate_rag_answer(question, top_chunks)
print("💡 RAG-Generated Answer:")
print(textwrap.fill(rag_answer_1, width=80))

/var/folders/2z/g737jg9d2jj206wkf56g7gyc0000gn/T/ipykernel_40961/2354009276.py:9: DeprecationWarning: `search` method is deprecated and will be removed in the future. Use `query_points` instead.
  search_results = qdrant.search(

💡 RAG-Generated Answer:
George Pataki gave several reasons for stepping down from public office. He
stated that it was the right decision for him, his family, the team that had
worked with him, and for the state. He also mentioned that "it was time" for him
to leave office after 12 years as governor. Additionally, he expressed that the
transition to private life was not difficult for him and that he embraced a
sense of normalcy after leaving office.

In [21]:

Copied!





# Example: Generate an answer for Question 1 using the top 3 chunks
question =  "Why did Kaci Hickox challenge the quarantine order, and what are the modern alternatives to quarantine?"
top_chunks = retrieve_top_k_chunks(question, k=3)
rag_answer = generate_rag_answer(question, top_chunks)
print("💡 RAG-Generated Answer:")
print(textwrap.fill(rag_answer, width=80))
# Example: Generate an answer for Question 1 using the top 3 chunks
question =  "Why did Kaci Hickox challenge the quarantine order, and what are the modern alternatives to quarantine?"
top_chunks = retrieve_top_k_chunks(question, k=3)
rag_answer = generate_rag_answer(question, top_chunks)
print("💡 RAG-Generated Answer:")
print(textwrap.fill(rag_answer, width=80))

/var/folders/2z/g737jg9d2jj206wkf56g7gyc0000gn/T/ipykernel_40961/2354009276.py:9: DeprecationWarning: `search` method is deprecated and will be removed in the future. Use `query_points` instead.
  search_results = qdrant.search(

💡 RAG-Generated Answer:
Kaci Hickox challenged the quarantine order because she felt that the treatment
of returning health care workers was unjust and that they should be treated with
dignity and humanity. She expressed concern about the stigma and isolation faced
by those who had helped in the Ebola crisis, stating that she did not want to
feel like a criminal or prisoner after her service. Hickox also highlighted the
need for more health care workers to assist in fighting the epidemic in West
Africa and advocated for a public education campaign to better inform people
about the virus.  As for modern alternatives to quarantine, the context mentions
"direct active monitoring," which involves coordinating movements with public
health authorities and reporting any symptoms without imposing strict movement
restrictions. This approach allows for monitoring while enabling individuals to
maintain more normal activities compared to traditional quarantine measures.

🧠 Pure LLM Answering – Without Retrieval¶

✅ In this step, we ask the same questions as before –
but without providing any chunks or evidence from the articles.

✅ This is what we call a pure LLM approach:

The LLM answers based only on its training data.
It may rely on general world knowledge but not on the specific article facts.

⚖️ Why Compare to RAG?¶

By comparing answers from:

1️⃣ LLM-only (no retrieval), vs.
2️⃣ RAG (with top-3 chunks as evidence),

…we can see how much the retrieval grounding improves factual accuracy and specificity.

✅ Let’s implement the pure LLM prompt next!

In [23]:

Copied!





def generate_llm_only_answer(question, model_name="gpt-4o-mini", temperature=0.2):
    """
    Generates an answer using an LLM, without any retrieved context.
    """
    messages = [
        {"role": "system", "content": "You are a helpful, factual question-answering assistant."},
        {"role": "user", "content": question}
    ]

    response = completion(
        model=model_name,
        messages=messages,
        temperature=temperature
    )
    return response["choices"][0]["message"]["content"].strip()


question =  "What reasons did George Pataki give for stepping down from public office?"
llm_answer = generate_llm_only_answer(question)
print("\n💡 LLM-Only Answer:\n")
print(textwrap.fill(llm_answer, width=80))

# Generate LLM-only answer for Question 2 (Kaci Hickox quarantine)
question =  "Why did Kaci Hickox challenge the quarantine order, and what are the modern alternatives to quarantine?"
llm_answer = generate_llm_only_answer(question)
print("\n💡 LLM-Only Answer:\n")
print(textwrap.fill(llm_answer, width=80))
def generate_llm_only_answer(question, model_name="gpt-4o-mini", temperature=0.2):
    """
    Generates an answer using an LLM, without any retrieved context.
    """
    messages = [
        {"role": "system", "content": "You are a helpful, factual question-answering assistant."},
        {"role": "user", "content": question}
    ]

    response = completion(
        model=model_name,
        messages=messages,
        temperature=temperature
    )
    return response["choices"][0]["message"]["content"].strip()


question =  "What reasons did George Pataki give for stepping down from public office?"
llm_answer = generate_llm_only_answer(question)
print("\n💡 LLM-Only Answer:\n")
print(textwrap.fill(llm_answer, width=80))

# Generate LLM-only answer for Question 2 (Kaci Hickox quarantine)
question =  "Why did Kaci Hickox challenge the quarantine order, and what are the modern alternatives to quarantine?"
llm_answer = generate_llm_only_answer(question)
print("\n💡 LLM-Only Answer:\n")
print(textwrap.fill(llm_answer, width=80))

💡 LLM-Only Answer:

George Pataki, the former Governor of New York, cited several reasons for
stepping down from public office after his third term ended in 2006. He
mentioned the desire to pursue new opportunities and challenges outside of the
political arena. Pataki expressed a need for personal and professional growth,
as well as a wish to spend more time with his family. Additionally, he indicated
that he felt it was time for new leadership and perspectives in New York. His
decision was also influenced by the natural conclusion of his tenure, as he had
already served three terms as governor.

💡 LLM-Only Answer:

Kaci Hickox, a nurse who returned to the United States after treating Ebola
patients in West Africa in 2014, challenged the quarantine order imposed on her
by the state of New Jersey. Hickox argued that the mandatory quarantine was
unnecessary and not based on scientific evidence, as she had shown no symptoms
of Ebola and had tested negative for the virus. She believed that the quarantine
was more about public fear than actual risk, and she sought to protect her
rights and advocate for a more rational approach to public health measures.  In
terms of modern alternatives to quarantine, several strategies have been
developed to manage infectious diseases while minimizing disruption to
individuals' lives. These alternatives include:  1. **Self-Monitoring**:
Individuals are advised to monitor their health for symptoms of the disease and
report any changes to health authorities.  2. **Active Monitoring**: Health
officials regularly check in with individuals who may have been exposed to a
disease, ensuring they are symptom-free and providing guidance on what to do if
symptoms develop.  3. **Testing**: Widespread testing can help identify infected
individuals quickly, allowing for targeted isolation rather than broad
quarantine measures.  4. **Vaccination**: Vaccines can prevent the spread of
infectious diseases, reducing the need for quarantine by lowering the number of
susceptible individuals.  5. **Travel Restrictions**: Instead of quarantine,
authorities may implement travel restrictions or advisories for certain regions
to limit the spread of disease.  6. **Public Health Campaigns**: Educating the
public about symptoms, transmission, and prevention can help reduce fear and
promote responsible behavior without the need for strict quarantine measures.
These alternatives aim to balance public health safety with individual rights
and societal functioning.

🔍 Comparison of LLM-Only vs. RAG Answers¶

✏️ Observations¶

✅ LLM-Only Answers

Answers are well-structured and provide general, textbook-like responses.
They are factually plausible but not always tied directly to the specific article content.
They sometimes include extra details (like testing, vaccination, travel restrictions) that were not in the article but are part of the LLM’s general knowledge.

✅ RAG-Generated Answers

Answers are concise and closely aligned with the article’s actual text.
They mirror the language and points made in the retrieved chunks, making them more context-specific.
They avoid hallucination (inventing new information) and stay grounded in what was retrieved from Qdrant.

🟢 Key Differences¶

The LLM-only approach tends to generalize and expand on the question using its broader training data.
The RAG approach is more precise and factual—it directly reflects the source material.
RAG’s answers are shorter and sharper, focusing on the evidence provided, while LLM-only answers are more discursive.

⚠️ Implications¶

RAG provides grounded, specific answers – it’s great for factual QA when exact information is critical (e.g., journalism, scientific retrieval).
LLM-only is useful for general knowledge but can include hallucinated or extraneous content.
In a real-world system, verifiability matters—RAG helps to ensure that answers are backed by actual documents, which is critical for trustworthiness.

🔍 Reranking in RAG¶

While our retriever has done a great job pulling in relevant chunks of information from our corpus, as the question were pretty linked to a single articles we can imagine much more complex questions that necesitate documents from several sources. The next step is to enhance the quality of these retrieved results through reranking. Reranking is especially useful because:

✅ It helps us identify the most relevant chunks from the top retrieved ones, rather than relying solely on initial similarity scores. ✅ It improves the final context passed to the language model for answer generation, leading to more precise and comprehensive responses. ✅ It mitigates issues like partial matches or overly generic chunks dominating the retrieval phase.

Why do we need reranking?
Although vector similarity (like cosine distance) helps to narrow down chunks related to a query, these initial scores don't always reflect true relevance or semantic richness of the content for the given question. Reranking helps reorder the top-k chunks based on a more nuanced evaluation of their relevance.

Some common reranking techniques:

Cross-Encoder Models (like BERT-based re-rankers): Take a query and a candidate chunk and return a fine-grained relevance score by jointly processing them.
Hybrid Scoring (like dense+BM25 fusion): Combine dense embeddings with traditional keyword-based scores to balance precision and recall.
Contextualized Late Interaction over BERT: reranking using the colbert-ir/colbertv2.0 model. ColBERT computes late interaction scores by computing similarity matrix between all query and document token embeddings, then taking max similarity for each query token and summing across all query tokens

How Qdrant fits in:
Qdrant allows seamless integration with reranking steps because it provides efficient retrieval of top-k candidates. Once we have those, we can plug them into a reranker (like a cross-encoder or a separate re-ranking model) to get improved ordering.

🚀 Next, we'll explore implementing a reranking step with our Qdrant retriever and see how our retriever's performance changes with this additional layer of semantic evaluation!

We will create a couple of more complex questions to see the differences.

In [32]:

Copied!





questions = [
    "How do the debates around whiskey production and healthy dieting highlight tensions between tradition and modern practices?",
    "How do breakthroughs in cancer treatments and technology show the promise of medical innovation for improving lives?"
]
questions = [
    "How do the debates around whiskey production and healthy dieting highlight tensions between tradition and modern practices?",
    "How do breakthroughs in cancer treatments and technology show the promise of medical innovation for improving lives?"
]

1️⃣ Cross-Encoder Reranking with a transformer model¶

We'll use a popular model from Hugging Face (cross-encoder/ms-marco-MiniLM-L-6-v2) to rerank the initial top-k retrieved chunks by re-scoring them with joint encoding.

In [69]:

Copied!





from sentence_transformers import CrossEncoder
import time
from IPython.display import Markdown, display

# Initialize the cross-encoder model
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def cross_encoder_rerank(question, retrieved_chunks):
    # Create pairs of (question, chunk) for scoring
    pairs = [[question, chunk["chunk"]] for chunk in retrieved_chunks]
    scores = cross_encoder.predict(pairs)

    # Add scores to chunks and sort
    for idx, chunk in enumerate(retrieved_chunks):
        chunk["rerank_score"] = scores[idx]

    return sorted(retrieved_chunks, key=lambda x: x["rerank_score"], reverse=True)

# Example usage for our first question
top_chunks = retrieve_top_k_chunks(questions[0], k=20)
start_time = time.time()
reranked_chunks = cross_encoder_rerank(questions[0], top_chunks)
end_time = time.time()
processing_time = end_time - start_time

# Create a dataframe to tabulate the info
df_results = pd.DataFrame({
    "Chunk": [c["chunk"][:80] + "..." for c in reranked_chunks],
    "Cosine Similarity Score": [c["score"] for c in reranked_chunks],
    "Reranked Cross Encoder Score": [c["rerank_score"] for c in reranked_chunks],
})

# Sort by the reranked score (descending)
df_results.sort_values(by="Cosine Similarity Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Original Rank (Cosine)"] = range(1, 21)
df_results.sort_values(by="Reranked Cross Encoder Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Final Rank (Reranked)"] = range(1, 21)

markdown_table = df_results[["Final Rank (Reranked)", "Original Rank (Cosine)", "Cosine Similarity Score", "Reranked Cross Encoder Score", "Chunk"]].to_markdown(index=False)
display(Markdown(f"### 🔥 Reranked Chunks Table for: '{questions[0]}'\n\n{markdown_table}\n\n⏱️ **Time to process reranking:** {processing_time:.2f} seconds"))
from sentence_transformers import CrossEncoder
import time
from IPython.display import Markdown, display

# Initialize the cross-encoder model
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def cross_encoder_rerank(question, retrieved_chunks):
    # Create pairs of (question, chunk) for scoring
    pairs = [[question, chunk["chunk"]] for chunk in retrieved_chunks]
    scores = cross_encoder.predict(pairs)

    # Add scores to chunks and sort
    for idx, chunk in enumerate(retrieved_chunks):
        chunk["rerank_score"] = scores[idx]

    return sorted(retrieved_chunks, key=lambda x: x["rerank_score"], reverse=True)

# Example usage for our first question
top_chunks = retrieve_top_k_chunks(questions[0], k=20)
start_time = time.time()
reranked_chunks = cross_encoder_rerank(questions[0], top_chunks)
end_time = time.time()
processing_time = end_time - start_time

# Create a dataframe to tabulate the info
df_results = pd.DataFrame({
    "Chunk": [c["chunk"][:80] + "..." for c in reranked_chunks],
    "Cosine Similarity Score": [c["score"] for c in reranked_chunks],
    "Reranked Cross Encoder Score": [c["rerank_score"] for c in reranked_chunks],
})

# Sort by the reranked score (descending)
df_results.sort_values(by="Cosine Similarity Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Original Rank (Cosine)"] = range(1, 21)
df_results.sort_values(by="Reranked Cross Encoder Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Final Rank (Reranked)"] = range(1, 21)

markdown_table = df_results[["Final Rank (Reranked)", "Original Rank (Cosine)", "Cosine Similarity Score", "Reranked Cross Encoder Score", "Chunk"]].to_markdown(index=False)
display(Markdown(f"### 🔥 Reranked Chunks Table for: '{questions[0]}'\n\n{markdown_table}\n\n⏱️ **Time to process reranking:** {processing_time:.2f} seconds"))

/var/folders/2z/g737jg9d2jj206wkf56g7gyc0000gn/T/ipykernel_40961/2354009276.py:9: DeprecationWarning: `search` method is deprecated and will be removed in the future. Use `query_points` instead.
  search_results = qdrant.search(

🔥 Reranked Chunks Table for: 'How do the debates around whiskey production and healthy dieting highlight tensions between tradition and modern practices?'¶

| Final Rank (Reranked) | |------------------------:| | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 | | 8 | | 9 | | 10 | | 11 | | 12 | | 13 | | 14 | | 15 | | 16 | | 17 | | 18 | | 19 | | 20 | Original Rank (Cosine) | Cosine Similarity Score | Reranked Cross Encoder Score | Chunk | -------------------------:|--------------------------:|-------------------------------:|:------------------------------------------------------------------------------------| 5 | 0.444744 | -7.19859 | president Guy L. Smith IV. 'This is about Brown-Forman trying to stifle competit... | 7 | 0.434182 | -8.76989 | that makes George Dickel, another famed Tennessee brand . Jack Daniel's stores i... | 2 | 0.514916 | -8.78003 | Brier Distillery in Nashville, said he supports tighter regulation. 'Holding our... | 3 | 0.482396 | -9.63583 | By . Associated Press . PUBLISHED: . 12:53 EST, 17 March 2014 . | . UPDATED: . 1... | 1 | 0.574478 | -9.8132 | and be aged in new, charred white oak barrels. Spirits that don't follow those g... | 11 | 0.385961 | -9.96246 | It's a fashionable way to shed the pounds. But following the Paleo or caveman di... | 6 | 0.440242 | -10.1015 | (CNN) -- In these days of austerity, thin profit margins, low competitiveness an... | 4 | 0.466174 | -10.3925 | to weaken a title on a label that we've worked very hard for,' said Jeff Arnett,... | 16 | 0.368845 | -10.8301 | risk factors for heart disease and diabetes over time. ‘While the reasons underl... | 13 | 0.375881 | -11.0408 | to have healing properties), and cream of tartar (a byproduct of winemaking that... | 9 | 0.396736 | -11.0529 | Martin . said: 'The company has always been innovative and this is an exciting .... | 10 | 0.390561 | -11.0929 | ranges in severity from benign tumors that need little or no treatment to very a... | 14 | 0.375721 | -11.153 | the odds of heart disease and strokes. Warning signs include high blood pressure... | 17 | 0.36777 | -11.2918 | cans and description of the father of the nation by the brewery is highly condem... | 19 | 0.363044 | -11.3069 | . campaign, the restaurant announced that it would pick up the bill for . any pa... | 12 | 0.383915 | -11.3104 | With the festive season in full swing many of us will be fretting about our over... | 8 | 0.41942 | -11.3293 | premium brand beers and spirits and wine from Provence. After that, I padded bar... | 18 | 0.367584 | -11.3558 | out with a load of other stuff you hadn’t planned on getting because you see all... | 15 | 0.375596 | -11.3678 | such as Marks & Spencer. And two weeks ago, Harvester opened . its first motorwa... | 20 | 0.362765 | -11.3787 | my sleep and I thought it was funny. 'After we had the banner made people starte... |

⏱️ Time to process reranking: 0.54 seconds

2️⃣ Hybrid Scoring (Dense + BM25 Fusion)¶

We'll simulate a hybrid approach by combining our dense similarity (cosine) with a BM25 score from a TfidfVectorizer. This balances precision and recall.

In [71]:

Copied!





from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_rerank(query, top_chunks, alpha=0.5):
    # Prepare texts for BM25
    texts = [chunk['chunk'] for chunk in top_chunks]
    tokenized_texts = [text.split() for text in texts]
    
    # BM25 scoring
    bm25 = BM25Okapi(tokenized_texts)
    bm25_scores = bm25.get_scores(query.split())
    
    # Get original semantic scores
    semantic_scores = np.array([chunk['score'] for chunk in top_chunks])
    
    # Normalize scores to 0-1 range
    bm25_scores = np.array(bm25_scores)
    if bm25_scores.max() > bm25_scores.min():
        bm25_norm = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())
    else:
        bm25_norm = np.ones_like(bm25_scores) * 0.5
    
    if semantic_scores.max() > semantic_scores.min():
        semantic_norm = (semantic_scores - semantic_scores.min()) / (semantic_scores.max() - semantic_scores.min())
    else:
        semantic_norm = np.ones_like(semantic_scores) * 0.5
    
    # Combine scores
    hybrid_scores = alpha * semantic_norm + (1 - alpha) * bm25_norm
    
    # Add hybrid scores to chunks
    reranked_chunks = []
    for i, chunk in enumerate(top_chunks):
        new_chunk = chunk.copy()
        new_chunk['bm25_score'] = float(bm25_scores[i])
        new_chunk['hybrid_score'] = float(hybrid_scores[i])
        reranked_chunks.append(new_chunk)

    return reranked_chunks

# Example usage for our second question
top_chunks = retrieve_top_k_chunks(questions[0], k=20)
start_time = time.time()
reranked_chunks = hybrid_rerank(questions[0], top_chunks)
end_time = time.time()
processing_time = end_time - start_time

# Create a dataframe to tabulate the info
df_results = pd.DataFrame({
    "Chunk": [c["chunk"][:80] + "..." for c in reranked_chunks],
    "Cosine Similarity Score": [c["score"] for c in reranked_chunks],
    "Reranked Hybrid Score": [c["hybrid_score"] for c in reranked_chunks],
})

# Sort by the reranked score (descending)
df_results.sort_values(by="Cosine Similarity Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Original Rank (Cosine)"] = range(1, 21)
df_results.sort_values(by="Reranked Hybrid Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Final Rank (Reranked)"] = range(1, 21)

# Display as Markdown
from IPython.display import Markdown, display

markdown_table = df_results[["Final Rank (Reranked)", "Original Rank (Cosine)", "Cosine Similarity Score", "Reranked Hybrid Score", "Chunk"]].to_markdown(index=False)
display(Markdown(f"### 🔥 Reranked Chunks Table for: '{questions[0]}'\n\n{markdown_table}\n\n⏱️ **Time to process reranking:** {processing_time:.2f} seconds"))
from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_rerank(query, top_chunks, alpha=0.5):
    # Prepare texts for BM25
    texts = [chunk['chunk'] for chunk in top_chunks]
    tokenized_texts = [text.split() for text in texts]
    
    # BM25 scoring
    bm25 = BM25Okapi(tokenized_texts)
    bm25_scores = bm25.get_scores(query.split())
    
    # Get original semantic scores
    semantic_scores = np.array([chunk['score'] for chunk in top_chunks])
    
    # Normalize scores to 0-1 range
    bm25_scores = np.array(bm25_scores)
    if bm25_scores.max() > bm25_scores.min():
        bm25_norm = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())
    else:
        bm25_norm = np.ones_like(bm25_scores) * 0.5
    
    if semantic_scores.max() > semantic_scores.min():
        semantic_norm = (semantic_scores - semantic_scores.min()) / (semantic_scores.max() - semantic_scores.min())
    else:
        semantic_norm = np.ones_like(semantic_scores) * 0.5
    
    # Combine scores
    hybrid_scores = alpha * semantic_norm + (1 - alpha) * bm25_norm
    
    # Add hybrid scores to chunks
    reranked_chunks = []
    for i, chunk in enumerate(top_chunks):
        new_chunk = chunk.copy()
        new_chunk['bm25_score'] = float(bm25_scores[i])
        new_chunk['hybrid_score'] = float(hybrid_scores[i])
        reranked_chunks.append(new_chunk)

    return reranked_chunks

# Example usage for our second question
top_chunks = retrieve_top_k_chunks(questions[0], k=20)
start_time = time.time()
reranked_chunks = hybrid_rerank(questions[0], top_chunks)
end_time = time.time()
processing_time = end_time - start_time

# Create a dataframe to tabulate the info
df_results = pd.DataFrame({
    "Chunk": [c["chunk"][:80] + "..." for c in reranked_chunks],
    "Cosine Similarity Score": [c["score"] for c in reranked_chunks],
    "Reranked Hybrid Score": [c["hybrid_score"] for c in reranked_chunks],
})

# Sort by the reranked score (descending)
df_results.sort_values(by="Cosine Similarity Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Original Rank (Cosine)"] = range(1, 21)
df_results.sort_values(by="Reranked Hybrid Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Final Rank (Reranked)"] = range(1, 21)

# Display as Markdown
from IPython.display import Markdown, display

markdown_table = df_results[["Final Rank (Reranked)", "Original Rank (Cosine)", "Cosine Similarity Score", "Reranked Hybrid Score", "Chunk"]].to_markdown(index=False)
display(Markdown(f"### 🔥 Reranked Chunks Table for: '{questions[0]}'\n\n{markdown_table}\n\n⏱️ **Time to process reranking:** {processing_time:.2f} seconds"))

/var/folders/2z/g737jg9d2jj206wkf56g7gyc0000gn/T/ipykernel_40961/2354009276.py:9: DeprecationWarning: `search` method is deprecated and will be removed in the future. Use `query_points` instead.
  search_results = qdrant.search(

🔥 Reranked Chunks Table for: 'How do the debates around whiskey production and healthy dieting highlight tensions between tradition and modern practices?'¶

| Final Rank (Reranked) | Original Rank (Cosine) | Cosine Similarity Score | Reranked Hybrid Score | Chunk | |------------------------:|-------------------------:|--------------------------:|------------------------:|:------------------------------------------------------------------------------------| | 1 | 1 | 0.574478 | 0.631693 | and be aged in new, charred white oak barrels. Spirits that don't follow those g... | | 2 | 5 | 0.444744 | 0.621462 | president Guy L. Smith IV. 'This is about Brown-Forman trying to stifle competit... | | 3 | 2 | 0.514916 | 0.610322 | Brier Distillery in Nashville, said he supports tighter regulation. 'Holding our... | | 4 | 4 | 0.466174 | 0.59529 | to weaken a title on a label that we've worked very hard for,' said Jeff Arnett,... | | 5 | 6 | 0.440242 | 0.59184 | (CNN) -- In these days of austerity, thin profit margins, low competitiveness an... | | 6 | 13 | 0.375881 | 0.530976 | to have healing properties), and cream of tartar (a byproduct of winemaking that... | | 7 | 16 | 0.368845 | 0.4791 | risk factors for heart disease and diabetes over time. ‘While the reasons underl... | | 8 | 3 | 0.482396 | 0.449895 | By . Associated Press . PUBLISHED: . 12:53 EST, 17 March 2014 . | . UPDATED: . 1... | | 9 | 10 | 0.390561 | 0.418354 | ranges in severity from benign tumors that need little or no treatment to very a... | | 10 | 11 | 0.385961 | 0.410612 | It's a fashionable way to shed the pounds. But following the Paleo or caveman di... | | 11 | 9 | 0.396736 | 0.360107 | Martin . said: 'The company has always been innovative and this is an exciting .... | | 12 | 7 | 0.434182 | 0.346554 | that makes George Dickel, another famed Tennessee brand . Jack Daniel's stores i... | | 13 | 8 | 0.41942 | 0.239148 | premium brand beers and spirits and wine from Provence. After that, I padded bar... | | 14 | 12 | 0.383915 | 0.117205 | With the festive season in full swing many of us will be fretting about our over... | | 15 | 18 | 0.367584 | 0.102468 | out with a load of other stuff you hadn’t planned on getting because you see all... | | 16 | 14 | 0.375721 | 0.101787 | the odds of heart disease and strokes. Warning signs include high blood pressure... | | 17 | 20 | 0.362765 | 0.0965417 | my sleep and I thought it was funny. 'After we had the banner made people starte... | | 18 | 17 | 0.36777 | 0.0930686 | cans and description of the father of the nation by the brewery is highly condem... | | 19 | 15 | 0.375596 | 0.0925927 | such as Marks & Spencer. And two weeks ago, Harvester opened . its first motorwa... | | 20 | 19 | 0.363044 | 0.000660256 | . campaign, the restaurant announced that it would pick up the bill for . any pa... |

⏱️ Time to process reranking: 0.00 seconds

3️⃣ ColBERT Late Interaction Reranker¶

We'll implement ColBERT (Contextualized Late Interaction over BERT) reranking using the colbert-ir/colbertv2.0 model. ColBERT computes late interaction scores by:

Query Processing: Tokenizing the query with special ColBERT format [CLS] [Q] {query} [MASK] repeated 32 times
Document Processing: Tokenizing each document with [CLS] [D] {document} format
Late Interaction Scoring: Computing similarity matrix between all query and document token embeddings, then taking max similarity for each query token and summing across all query tokens

This approach provides more fine-grained token-level interactions compared to single-vector similarity, leading to better relevance scoring while maintaining efficiency.

In [72]:

Copied!





import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
import pandas as pd
import time
from IPython.display import Markdown, display

def colbert_rerank(query, top_chunks, model_name="colbert-ir/colbertv2.0"):
    """
    Rerank chunks using ColBERT late interaction scoring
    """
    # Load model and tokenizer using transformers
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model = model.to(device)
    model.eval()
    
    # Tokenize query with special ColBERT tokens
    query_text = f"[CLS] [Q] {query} [MASK]" * 32  # ColBERT query format
    query_tokens = tokenizer(
        query_text, 
        return_tensors="pt", 
        padding=True, 
        truncation=True, 
        max_length=512
    ).to(device)
    
    # Get query embeddings
    with torch.no_grad():
        query_outputs = model(**query_tokens)
        query_embeddings = query_outputs.last_hidden_state.squeeze(0)  # Remove batch dim
        # Normalize embeddings
        query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=-1)
    
    colbert_scores = []
    
    # Score each chunk
    for chunk in top_chunks:
        # Tokenize document with ColBERT format
        doc_text = f"[CLS] [D] {chunk['chunk']}"
        doc_tokens = tokenizer(
            doc_text,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        ).to(device)
        
        # Get document embeddings
        with torch.no_grad():
            doc_outputs = model(**doc_tokens)
            doc_embeddings = doc_outputs.last_hidden_state.squeeze(0)  # Remove batch dim
            # Normalize embeddings
            doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=-1)
        
        # Compute late interaction score
        # ColBERT score: sum over query tokens of max similarity with doc tokens
        scores_matrix = torch.matmul(query_embeddings, doc_embeddings.transpose(0, 1))
        max_scores = torch.max(scores_matrix, dim=1)[0]  # Max over document tokens
        colbert_score = torch.sum(max_scores).item()  # Sum over query tokens
        
        colbert_scores.append(colbert_score)
    
    # Add ColBERT scores to chunks
    reranked_chunks = []
    for i, chunk in enumerate(top_chunks):
        new_chunk = chunk.copy()
        new_chunk['colbert_score'] = float(colbert_scores[i])
        reranked_chunks.append(new_chunk)
    
    # Sort by ColBERT score (descending)
    reranked_chunks.sort(key=lambda x: x['colbert_score'], reverse=True)
    
    return reranked_chunks

# Example usage for your first question
top_chunks = retrieve_top_k_chunks(questions[0], k=20)
start_time = time.time()
reranked_chunks = colbert_rerank(questions[0], top_chunks)
end_time = time.time()
processing_time = end_time - start_time

# Create a dataframe to tabulate the info
df_results = pd.DataFrame({
    "Chunk": [c["chunk"][:80] + "..." for c in reranked_chunks],
    "Cosine Similarity Score": [c["score"] for c in reranked_chunks],
    "Reranked Colbert Score": [c["colbert_score"] for c in reranked_chunks],
})

# Sort by the reranked score (descending)
df_results.sort_values(by="Cosine Similarity Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Original Rank (Cosine)"] = range(1, 21)
df_results.sort_values(by="Reranked Colbert Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Final Rank (Reranked)"] = range(1, 21)


markdown_table = df_results[["Final Rank (Reranked)", "Original Rank (Cosine)", "Cosine Similarity Score", "Reranked Colbert Score", "Chunk"]].to_markdown(index=False)
display(Markdown(f"### 🔥 Reranked Chunks Table for: '{questions[0]}'\n\n{markdown_table}\n\n⏱️ **Time to process reranking:** {processing_time:.2f} seconds"))
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
import pandas as pd
import time
from IPython.display import Markdown, display

def colbert_rerank(query, top_chunks, model_name="colbert-ir/colbertv2.0"):
    """
    Rerank chunks using ColBERT late interaction scoring
    """
    # Load model and tokenizer using transformers
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model = model.to(device)
    model.eval()
    
    # Tokenize query with special ColBERT tokens
    query_text = f"[CLS] [Q] {query} [MASK]" * 32  # ColBERT query format
    query_tokens = tokenizer(
        query_text, 
        return_tensors="pt", 
        padding=True, 
        truncation=True, 
        max_length=512
    ).to(device)
    
    # Get query embeddings
    with torch.no_grad():
        query_outputs = model(**query_tokens)
        query_embeddings = query_outputs.last_hidden_state.squeeze(0)  # Remove batch dim
        # Normalize embeddings
        query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=-1)
    
    colbert_scores = []
    
    # Score each chunk
    for chunk in top_chunks:
        # Tokenize document with ColBERT format
        doc_text = f"[CLS] [D] {chunk['chunk']}"
        doc_tokens = tokenizer(
            doc_text,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        ).to(device)
        
        # Get document embeddings
        with torch.no_grad():
            doc_outputs = model(**doc_tokens)
            doc_embeddings = doc_outputs.last_hidden_state.squeeze(0)  # Remove batch dim
            # Normalize embeddings
            doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=-1)
        
        # Compute late interaction score
        # ColBERT score: sum over query tokens of max similarity with doc tokens
        scores_matrix = torch.matmul(query_embeddings, doc_embeddings.transpose(0, 1))
        max_scores = torch.max(scores_matrix, dim=1)[0]  # Max over document tokens
        colbert_score = torch.sum(max_scores).item()  # Sum over query tokens
        
        colbert_scores.append(colbert_score)
    
    # Add ColBERT scores to chunks
    reranked_chunks = []
    for i, chunk in enumerate(top_chunks):
        new_chunk = chunk.copy()
        new_chunk['colbert_score'] = float(colbert_scores[i])
        reranked_chunks.append(new_chunk)
    
    # Sort by ColBERT score (descending)
    reranked_chunks.sort(key=lambda x: x['colbert_score'], reverse=True)
    
    return reranked_chunks

# Example usage for your first question
top_chunks = retrieve_top_k_chunks(questions[0], k=20)
start_time = time.time()
reranked_chunks = colbert_rerank(questions[0], top_chunks)
end_time = time.time()
processing_time = end_time - start_time

# Create a dataframe to tabulate the info
df_results = pd.DataFrame({
    "Chunk": [c["chunk"][:80] + "..." for c in reranked_chunks],
    "Cosine Similarity Score": [c["score"] for c in reranked_chunks],
    "Reranked Colbert Score": [c["colbert_score"] for c in reranked_chunks],
})

# Sort by the reranked score (descending)
df_results.sort_values(by="Cosine Similarity Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Original Rank (Cosine)"] = range(1, 21)
df_results.sort_values(by="Reranked Colbert Score", ascending=False, inplace=True)
df_results.reset_index(drop=True, inplace=True)
df_results["Final Rank (Reranked)"] = range(1, 21)


markdown_table = df_results[["Final Rank (Reranked)", "Original Rank (Cosine)", "Cosine Similarity Score", "Reranked Colbert Score", "Chunk"]].to_markdown(index=False)
display(Markdown(f"### 🔥 Reranked Chunks Table for: '{questions[0]}'\n\n{markdown_table}\n\n⏱️ **Time to process reranking:** {processing_time:.2f} seconds"))

/var/folders/2z/g737jg9d2jj206wkf56g7gyc0000gn/T/ipykernel_40961/2354009276.py:9: DeprecationWarning: `search` method is deprecated and will be removed in the future. Use `query_points` instead.
  search_results = qdrant.search(

🔥 Reranked Chunks Table for: 'How do the debates around whiskey production and healthy dieting highlight tensions between tradition and modern practices?'¶

| Final Rank (Reranked) | Original Rank (Cosine) | Cosine Similarity Score | Reranked Colbert Score | Chunk | |------------------------:|-------------------------:|--------------------------:|-------------------------:|:------------------------------------------------------------------------------------| | 1 | 7 | 0.434182 | 290.547 | that makes George Dickel, another famed Tennessee brand . Jack Daniel's stores i... | | 2 | 2 | 0.514916 | 287.272 | Brier Distillery in Nashville, said he supports tighter regulation. 'Holding our... | | 3 | 1 | 0.574478 | 283.657 | and be aged in new, charred white oak barrels. Spirits that don't follow those g... | | 4 | 5 | 0.444744 | 283.302 | president Guy L. Smith IV. 'This is about Brown-Forman trying to stifle competit... | | 5 | 6 | 0.440242 | 282.099 | (CNN) -- In these days of austerity, thin profit margins, low competitiveness an... | | 6 | 4 | 0.466174 | 281.626 | to weaken a title on a label that we've worked very hard for,' said Jeff Arnett,... | | 7 | 12 | 0.383915 | 280.421 | With the festive season in full swing many of us will be fretting about our over... | | 8 | 16 | 0.368845 | 279.302 | risk factors for heart disease and diabetes over time. ‘While the reasons underl... | | 9 | 11 | 0.385961 | 275.225 | It's a fashionable way to shed the pounds. But following the Paleo or caveman di... | | 10 | 10 | 0.390561 | 273.49 | ranges in severity from benign tumors that need little or no treatment to very a... | | 11 | 9 | 0.396736 | 272.578 | Martin . said: 'The company has always been innovative and this is an exciting .... | | 12 | 8 | 0.41942 | 272.573 | premium brand beers and spirits and wine from Provence. After that, I padded bar... | | 13 | 20 | 0.362765 | 272.413 | my sleep and I thought it was funny. 'After we had the banner made people starte... | | 14 | 15 | 0.375596 | 269.677 | such as Marks & Spencer. And two weeks ago, Harvester opened . its first motorwa... | | 15 | 18 | 0.367584 | 267.467 | out with a load of other stuff you hadn’t planned on getting because you see all... | | 16 | 3 | 0.482396 | 266.592 | By . Associated Press . PUBLISHED: . 12:53 EST, 17 March 2014 . | . UPDATED: . 1... | | 17 | 19 | 0.363044 | 266.139 | . campaign, the restaurant announced that it would pick up the bill for . any pa... | | 18 | 13 | 0.375881 | 262.35 | to have healing properties), and cream of tartar (a byproduct of winemaking that... | | 19 | 17 | 0.36777 | 257.641 | cans and description of the father of the nation by the brewery is highly condem... | | 20 | 14 | 0.375721 | 253.594 | the odds of heart disease and strokes. Warning signs include high blood pressure... |

⏱️ Time to process reranking: 3.66 seconds

📊 Reranking Insights & Observations¶

✅ Cosine Similarity (Baseline):

Fast (no reranking step).
Results purely depend on vector distance—no nuance for subtle meaning or semantic shifts.
Top chunks are ordered based on vector space proximity, sometimes capturing peripheral mentions more than core relevance.

✅ Cross-Encoder Reranking:

~0.54 seconds for reranking.
Major reshuffling of ranks! It prioritizes chunks that directly address nuanced aspects of the question (like competitive market pressures vs. health in dieting).
Cross-encoders deeply match query and chunk context, showing powerful semantic understanding.

✅ Hybrid Scoring (Cosine + Cross-Encoder):

Near-instant reranking (0 seconds).
Smooths out differences between fast vector-based ranks and slower cross-encoder ones.
Often preserves some high-scoring cosine entries (like Top 1) while improving middle-range chunks.

✅ ColBERT Reranking:

~3.66 seconds—slower because of fine-grained token-level interactions.
Significantly reorders compared to cosine (even more dramatic in mid/lower ranks!).
Useful when precision in nuance is critical (e.g., highly specialized domains).

💡 How Each Method Alters the Landscape¶

Method	Top 1 Match?	Big Shifts?	Highlights
Cosine	✔️	Minimal	Fast baseline but less nuance; relevance = pure proximity.
Cross-Encoder	❌ (reshuffle)	High (semantic)	Prioritizes interpretation of question and document text.
Hybrid	✔️	Moderate	Balances vector + semantic power, bridging both for speed & nuance.
ColBERT	❌	Large shifts	Deep contextual word-level understanding—best for subtle context.

🛠️ Other Techniques to Enhance the Retriever¶

Multi-Vector Retrieval: Using multiple models to average out biases in individual vector spaces—especially helpful when dealing with broad topics (like whiskey + dieting!)
Query Expansion: Reformulating the question to capture synonyms or related phrases—can boost recall, especially in complex debates (like tradition vs. modern health trends).
HyDE (Hypothetical Document Embeddings): Generating “ideal answer” text to improve retrieval when queries are too short or ambiguous.
Multi-hop: Iterative retrieval with follow-ups for multi-part or complex queries (like historical vs. modern impacts).
Adaptive: Dynamically choosing which technique based on query complexity.

🔍 Evaluating Retrieval Quality¶

While these reranking methods improve the semantic fit of retrieved chunks, how do we know which one is “best”?

📊 Key Metrics to Evaluate Retrieval Quality¶

1️⃣ Recall@k

What it measures: Proportion of relevant chunks retrieved in the top k results.
Why it’s important: Measures completeness — are we finding all relevant answers?
Formula:

$$ \text{Recall@k} = \frac{\text{Number of relevant retrieved chunks in top k}}{\text{Total relevant chunks}} $$

2️⃣ Precision@k

What it measures: Proportion of retrieved chunks in the top k that are actually relevant.
Why it’s important: Measures accuracy of what’s being shown to the user.
Formula:

$$ \text{Precision@k} = \frac{\text{Number of relevant retrieved chunks in top k}}{k} $$

3️⃣ Mean Reciprocal Rank (MRR)

What it measures: Focuses on the rank of the first relevant result.
Why it’s important: Shows how fast users find a relevant chunk.
Formula:

$$ \text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i} $$

where $\text{rank}_i$ is the rank of the first relevant result for the i-th query.

4️⃣ Normalized Discounted Cumulative Gain (nDCG)

What it measures: Accounts for relevance and position (discounting lower-ranked results).
Why it’s important: It’s nuanced—ranking relevant content earlier is better.
Formula (simplified):

$$ \text{nDCG@k} = \frac{\text{DCG@k}}{\text{IDCG@k}} $$

where DCG@k = $\sum_{i=1}^k \frac{2^{\text{relevance}_i} - 1}{\log_2(i+1)}$.

5️⃣ Average Precision (AP)

What it measures: Average of precision values whenever a relevant chunk is retrieved.
Why it’s important: Balances precision and recall across all ranks.
Formula (for one query):

$$ \text{AP} = \frac{1}{\text{Total relevant}} \sum_{\text{rank where relevant}} \text{Precision@rank} $$

⚠️ What if we don’t have ground truth data?¶

If no human-annotated relevance data exists, you can still get proxy indicators:

✅ 1️⃣ Heuristics & Overlap

Compare the overlap between different retrievers’ top-k results. More agreement may hint at quality.
Example: Jaccard similarity of top-10 results.

✅ 2️⃣ LLM-based Judgments

Use an LLM (like GPT-4) to generate simulated relevance labels for queries and retrieved chunks.
Example prompt: “For this query and chunk, rate relevance on a 0–2 scale.”

✅ 3️⃣ User Feedback

Run live A/B tests: show different ranked lists to real users and measure clicks, time spent, etc.

✅ 4️⃣ Proxy Features

Average embedding similarity of top-10 chunks (closer to query = potentially better).
Distribution of scores (less steep drop-off may indicate more diverse and meaningful results).

📰 Retrieval-Augmented Generation (RAG) – Grounding with CNN/DailyMail¶

🎯 Goals¶

🧩 Why CNN/DailyMail?¶

🔬 Key Steps¶

🌍 Infrastructure & Tools¶

📊 Article Length Insights & Chunking Strategy¶

🔍 Why This Matters?¶

💡 Next Steps¶

🏗️ Step-by-Step: Set Up a Qdrant Cluster¶

1️⃣ Create a Free Qdrant Account¶

2️⃣ Create a New Cluster¶

3️⃣ Keep Your Cluster Info¶

4️⃣ Save Credentials in .env File¶

5️⃣ Use Credentials in Python¶

🏗️ Step-by-Step: Embedding and Indexing in Qdrant¶

1️⃣ Compute Chunk Embeddings¶

2️⃣ Create a Qdrant Collection¶

3️⃣ Prepare Data for Upsert (Insert)¶

4️⃣ Upload Data to Qdrant¶

🔍 Retriever Step – Finding Relevant Chunks for a Query¶

🧩 Why It Matters¶

⚙️ How It Works in Our Notebook¶

📝 Comments on Retrieved Chunks¶

🔍 Observations & Takeaways¶

🧩 Building a RAG Prompt for LLM Answer Generation¶

🔍 How the Prompt is Structured¶

⚠️ Why It Matters?¶

🧠 Pure LLM Answering – Without Retrieval¶

⚖️ Why Compare to RAG?¶

🔍 Comparison of LLM-Only vs. RAG Answers¶

✏️ Observations¶

🟢 Key Differences¶

⚠️ Implications¶

🔍 Reranking in RAG¶

1️⃣ Cross-Encoder Reranking with a transformer model¶

🔥 Reranked Chunks Table for: 'How do the debates around whiskey production and healthy dieting highlight tensions between tradition and modern practices?'¶

2️⃣ Hybrid Scoring (Dense + BM25 Fusion)¶

🔥 Reranked Chunks Table for: 'How do the debates around whiskey production and healthy dieting highlight tensions between tradition and modern practices?'¶

3️⃣ ColBERT Late Interaction Reranker¶

🔥 Reranked Chunks Table for: 'How do the debates around whiskey production and healthy dieting highlight tensions between tradition and modern practices?'¶

📊 Reranking Insights & Observations¶

💡 How Each Method Alters the Landscape¶

🛠️ Other Techniques to Enhance the Retriever¶

🔍 Evaluating Retrieval Quality¶

📊 Key Metrics to Evaluate Retrieval Quality¶

⚠️ What if we don’t have ground truth data?¶

4️⃣ Save Credentials in `.env` File¶