import os
from bertopic import BERTopic
from bertopic.representation import LiteLLM
# Create a custom prompt for title and summary
summary_prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the above information, please provide a two-sentence summary of the main points
Format your response as:
summary: <two sentences>
"""
title_prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the above information, please provide a concise title for this topic
Format your response as:
title: <title>
"""
# Create your representation model with the custom prompt
representation_model = {"title": LiteLLM(
model="gpt-4o-mini",
prompt=title_prompt
), "summary": LiteLLM(
model="gpt-4o-mini",
prompt=summary_prompt
)}
# Create our BERTopic model
topic_model = BERTopic(representation_model=representation_model, verbose=True)
import os
from bertopic import BERTopic
from bertopic.representation import LiteLLM
# Create a custom prompt for title and summary
summary_prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the above information, please provide a two-sentence summary of the main points
Format your response as:
summary:
"""
title_prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the above information, please provide a concise title for this topic
Format your response as:
title:
"""
# Create your representation model with the custom prompt
representation_model = {"title": LiteLLM(
model="gpt-4o-mini",
prompt=title_prompt
), "summary": LiteLLM(
model="gpt-4o-mini",
prompt=summary_prompt
)}
# Create our BERTopic model
topic_model = BERTopic(representation_model=representation_model, verbose=True)</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="fit-bertopic-and-explore-topics">💬 Fit BERTopic and Explore Topics<a class="anchor-link" href="#fit-bertopic-and-explore-topics">¶</a></h2><p>Now let’s apply BERTopic to our dataset.</p>
<h3 id="how-it-works">🧩 How it Works:<a class="anchor-link" href="#how-it-works">¶</a></h3><ol>
<li><p><strong>Embeddings</strong>:</p>
<ul>
<li>BERTopic uses a pre-trained model (<code>all-MiniLM-L6-v2</code>) to generate vector representations of each document.</li>
<li>This model was fine-tuned on sentence similarity tasks, so it’s well-suited to group semantically similar texts.</li>
</ul>
</li>
<li><p><strong>Dimensionality Reduction</strong>:</p>
<ul>
<li>UMAP reduces the high-dimensional embeddings to 5-20 dimensions for easier clustering.</li>
</ul>
</li>
<li><p><strong>Clustering</strong>:</p>
<ul>
<li>HDBSCAN identifies clusters in the reduced space — each representing a distinct topic.</li>
</ul>
</li>
<li><p><strong>Topic Representation</strong>:</p>
<ul>
<li>We use GPT (via LiteLLM) to summarize each topic based on its most representative texts and keywords.</li>
</ul>
</li>
</ol>
<p>This means we can cluster news articles and automatically generate titles and summaries for each topic. Let’s go!</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [3]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-3", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="c1"># Extract the texts (BBC news headlines) from the dataset</span>
<span class="n">texts</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">document</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="c1"># Fit the BERTopic model on the texts</span>
<span class="n">topics</span><span class="p">,</span> <span class="n">probs</span> <span class="o">=</span> <span class="n">topic_model</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">texts</span><span class="p">)</span>
</pre></div>
<div id="cell-3" class="clipboard-copy-txt"># Extract the texts (BBC news headlines) from the dataset
texts = df.document.tolist()
# Fit the BERTopic model on the texts
topics, probs = topic_model.fit_transform(texts)</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="application/vnd.jupyter.stderr">
<pre>2025-05-12 15:07:07,574 - BERTopic - Embedding - Transforming documents to embeddings.
</pre>
</div>
</div>
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output " data-mime-type="text/plain">
<pre>Batches: 0%| | 0/70 [00:00<?, ?it/s]</pre>
</div>
</div>
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="application/vnd.jupyter.stderr">
<pre>2025-05-12 15:07:29,486 - BERTopic - Embedding - Completed ✓
2025-05-12 15:07:29,486 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-12 15:07:36,494 - BERTopic - Dimensionality - Completed ✓
2025-05-12 15:07:36,495 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 15:07:36,529 - BERTopic - Cluster - Completed ✓
2025-05-12 15:07:36,531 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-05-12 15:11:09,392 - BERTopic - Representation - Completed ✓
</pre>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="visualize-topic-clusters-with-llm-generated-titles">📊 Visualize Topic Clusters with LLM-Generated Titles<a class="anchor-link" href="#visualize-topic-clusters-with-llm-generated-titles">¶</a></h2><p>After fitting our BERTopic model on the corpus, the next step is to explore the <strong>semantic structure</strong> of our dataset by visualizing:</p>
<ul>
<li>How topics are distributed across the corpus.</li>
<li>What each cluster likely represents (based on LLM-generated <strong>titles</strong> and <strong>summaries</strong>).</li>
<li>How topics are related or nested hierarchically.</li>
</ul>
<hr />
<h3 id="why-custom-titles">🎯 Why Custom Titles?<a class="anchor-link" href="#why-custom-titles">¶</a></h3><p>By default, BERTopic assigns a list of top keywords to represent each topic.<br />
But this can be hard to interpret at a glance.</p>
<p>We’ve used <strong>GPT-4o via LiteLLM</strong> to generate:</p>
<ul>
<li>A clear <strong>title</strong> summarizing what the topic is about.</li>
<li>A brief <strong>summary</strong> of the documents within it.</li>
</ul>
<p>We now <strong>inject these titles</strong> into BERTopic using <code>set_topic_labels()</code> to replace raw keywords with human-readable, descriptive topic names in all visualizations.</p>
<hr />
<h3 id="visualizations-provided-by-bertopic">🖼️ Visualizations Provided by BERTopic:<a class="anchor-link" href="#visualizations-provided-by-bertopic">¶</a></h3><ol>
<li><p><strong><code>visualize_topics()</code></strong>:</p>
<ul>
<li>A 2D interactive scatter plot (via UMAP).</li>
<li>Each circle represents a topic; size = number of documents.</li>
<li>Hovering shows the topic <strong>title</strong> and top keywords.</li>
<li>Great for understanding the overall distribution of themes.</li>
</ul>
</li>
<li><p><strong><code>visualize_hierarchy()</code></strong>:</p>
<ul>
<li>Displays a <strong>tree diagram</strong> showing how topics relate to each other.</li>
<li>Useful to spot <strong>nested structures</strong>, <strong>overlapping themes</strong>, or <strong>potential subclusters</strong>.</li>
</ul>
</li>
</ol>
<hr />
<h3 id="how-it-works-under-the-hood">🛠️ How It Works (Under the Hood):<a class="anchor-link" href="#how-it-works-under-the-hood">¶</a></h3><ul>
<li><strong>Step 1</strong>: We extract topic titles using <code>topic_model.get_topic_info()</code>.</li>
<li><strong>Step 2</strong>: We format them (clean up “title: <...>” syntax).</li>
<li><strong>Step 3</strong>: We pass these titles into <code>custom_labels</code> and feed that into the visualizers.</li>
</ul>
<p>These interactive plots are excellent tools for <strong>exploration</strong>, <strong>storytelling</strong>, and <strong>decision-making</strong> based on topic structure.</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [18]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-4", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="c1"># Get the custom titles generated by LiteLLM</span>
<span class="n">titles</span> <span class="o">=</span> <span class="n">topic_model</span><span class="o">.</span><span class="n">get_topic_info</span><span class="p">()</span>
<span class="c1"># Build a mapping from topic number to LLM title</span>
<span class="n">custom_labels</span> <span class="o">=</span> <span class="p">{</span>
<span class="n">row</span><span class="p">[</span><span class="s2">"Topic"</span><span class="p">]:</span> <span class="n">row</span><span class="p">[</span><span class="s2">"title"</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">":"</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">titles</span><span class="o">.</span><span class="n">iterrows</span><span class="p">()</span>
<span class="k">if</span> <span class="n">row</span><span class="p">[</span><span class="s2">"Topic"</span><span class="p">]</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span> <span class="c1"># exclude outliers</span>
<span class="p">}</span>
<span class="n">topic_model</span><span class="o">.</span><span class="n">set_topic_labels</span><span class="p">(</span><span class="n">custom_labels</span><span class="p">)</span>
<span class="n">topic_model</span><span class="o">.</span><span class="n">visualize_topics</span><span class="p">(</span><span class="n">custom_labels</span><span class="o">=</span><span class="n">custom_labels</span><span class="p">)</span>
</pre></div>
<div id="cell-4" class="clipboard-copy-txt"># Get the custom titles generated by LiteLLM
titles = topic_model.get_topic_info()
# Build a mapping from topic number to LLM title
custom_labels = {
row["Topic"]: row["title"][0].split(":")[1].strip() for _, row in titles.iterrows()
if row["Topic"] != -1 # exclude outliers
}
topic_model.set_topic_labels(custom_labels)
topic_model.visualize_topics(custom_labels=custom_labels)</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [19]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-5", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="n">hierarchical_topics</span> <span class="o">=</span> <span class="n">topic_model</span><span class="o">.</span><span class="n">hierarchical_topics</span><span class="p">(</span><span class="n">texts</span><span class="p">)</span>
<span class="n">topic_model</span><span class="o">.</span><span class="n">visualize_hierarchy</span><span class="p">(</span><span class="n">hierarchical_topics</span><span class="o">=</span><span class="n">hierarchical_topics</span><span class="p">,</span> <span class="n">custom_labels</span><span class="o">=</span><span class="n">custom_labels</span><span class="p">)</span>
</pre></div>
<div id="cell-5" class="clipboard-copy-txt">hierarchical_topics = topic_model.hierarchical_topics(texts)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics, custom_labels=custom_labels)</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="application/vnd.jupyter.stderr">
<pre>100%|██████████| 39/39 [00:00<00:00, 543.00it/s]
</pre>
</div>
</div>
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="topic-summary-table-exploration">📋 Topic Summary Table & Exploration<a class="anchor-link" href="#topic-summary-table-exploration">¶</a></h2><p>Let’s now inspect what each topic is about.</p>
<h3 id="what-well-do">🔍 What We’ll Do:<a class="anchor-link" href="#what-well-do">¶</a></h3><ul>
<li>Print a concise <strong>summary of all topics</strong>.</li>
<li>Show the <strong>top words</strong> per topic.</li>
<li>Display the <strong>LLM-generated labels or summaries</strong>.</li>
<li>Use a utility function to <strong>explore each topic</strong> interactively.</li>
</ul>
<p>This helps us understand what themes dominate our dataset — and where we might want to dive deeper.</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [6]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-6", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="c1"># View the top 10 topics with their representative terms and labels</span>
<span class="n">topic_info</span> <span class="o">=</span> <span class="n">topic_model</span><span class="o">.</span><span class="n">get_topic_info</span><span class="p">()</span>
<span class="n">topic_info</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</pre></div>
<div id="cell-6" class="clipboard-copy-txt"># View the top 10 topics with their representative terms and labels
topic_info = topic_model.get_topic_info()
topic_info.head(10)</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child jp-OutputArea-executeResult">
<div class="jp-OutputPrompt jp-OutputArea-prompt">Out[6]:</div>
<div class="jp-RenderedHTMLCommon jp-RenderedHTML jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/html">
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Topic</th>
<th>Count</th>
<th>Name</th>
<th>Representation</th>
<th>title</th>
<th>summary</th>
<th>Representative_Docs</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-1</td>
<td>478</td>
<td>-1_the_to_of_and</td>
<td>[the, to, of, and, in, said, that, it, is, for]</td>
<td>[title: Political Landscapes: Reforms, Electio...</td>
<td>[summary: Labour plans to pursue further refor...</td>
<td>[fresh hope after argentine crisis three years...</td>
</tr>
<tr>
<th>1</th>
<td>0</td>
<td>186</td>
<td>0_he_we_the_to</td>
<td>[he, we, the, to, club, but, and, chelsea, his...</td>
<td>[title: Gerrard's Future Amid Chelsea Interest...</td>
<td>[summary: Liverpool's chief executive Rick Par...</td>
<td>[mourinho plots impressive course chelsea s wi...</td>
</tr>
<tr>
<th>2</th>
<td>1</td>
<td>168</td>
<td>1_film_the_best_in</td>
<td>[film, the, best, in, and, of, for, actor, fil...</td>
<td>[title: Oscar and BAFTA Highlights: Celebratin...</td>
<td>[summary: The film "Vera Drake" is gaining rec...</td>
<td>[bafta to hand out movie honours movie stars f...</td>
</tr>
<tr>
<th>3</th>
<td>2</td>
<td>147</td>
<td>2_england_wales_ireland_the</td>
<td>[england, wales, ireland, the, rugby, and, in,...</td>
<td>[title: Rugby Rivalries: England, Ireland, and...</td>
<td>[summary: The recent events in rugby highlight...</td>
<td>[o gara revels in ireland victory ireland fly-...</td>
</tr>
<tr>
<th>4</th>
<td>3</td>
<td>129</td>
<td>3_music_band_the_album</td>
<td>[music, band, the, album, and, in, of, song, t...</td>
<td>[title: Celebrating Music Legends and Awards]</td>
<td>[summary: The music industry recently celebrat...</td>
<td>[brits debate over urban music joss stone a...</td>
</tr>
<tr>
<th>5</th>
<td>4</td>
<td>86</td>
<td>4_that_to_security_the</td>
<td>[that, to, security, the, of, and, virus, spam...</td>
<td>[title: Trends in Cybersecurity: Rising Threat...</td>
<td>[summary: The increasing concern over cybersec...</td>
<td>[rings of steel combat net attacks gambling is...</td>
</tr>
<tr>
<th>6</th>
<td>5</td>
<td>86</td>
<td>5_roddick_open_in_the</td>
<td>[roddick, open, in, the, seed, to, was, austra...</td>
<td>[title: Australian Open 2023: Highlights and K...</td>
<td>[summary: The documents highlight the performa...</td>
<td>[davenport dismantles young rival top seed lin...</td>
</tr>
<tr>
<th>7</th>
<td>6</td>
<td>64</td>
<td>6_the_to_of_said</td>
<td>[the, to, of, said, be, he, police, and, in, t...</td>
<td>[title: House Arrest for Terror Suspects: Gove...</td>
<td>[summary: The UK government is facing signific...</td>
<td>[terror suspects face house arrest uk citizens...</td>
</tr>
<tr>
<th>8</th>
<td>7</td>
<td>63</td>
<td>7_mr_labour_blair_he</td>
<td>[mr, labour, blair, he, election, brown, the, ...</td>
<td>[title: Tensions Between Blair and Brown Amid ...</td>
<td>[summary: Tensions between Tony Blair and Gord...</td>
<td>[blair and brown criticised by mps labour mps ...</td>
</tr>
<tr>
<th>9</th>
<td>8</td>
<td>59</td>
<td>8_broadband_bt_to_net</td>
<td>[broadband, bt, to, net, the, of, and, is, tv,...</td>
<td>[title: The Rise of Broadband in the UK: Growt...</td>
<td>[summary: The popularity of broadband in the U...</td>
<td>[broadband in the uk growing fast high-speed n...</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [7]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-7", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">textwrap</span>
<span class="c1"># Show the summary representation of each topic</span>
<span class="k">for</span> <span class="n">topic_id</span> <span class="ow">in</span> <span class="n">topic_info</span><span class="p">[</span><span class="s1">'Topic'</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">():</span>
<span class="k">if</span> <span class="n">topic_id</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="k">continue</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="se">\n</span><span class="s2">📌 Topic </span><span class="si">{</span><span class="n">topic_id</span><span class="si">}</span><span class="s2">:"</span><span class="p">)</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">topic_model</span><span class="o">.</span><span class="n">get_topic_info</span><span class="p">(</span><span class="n">topic_id</span><span class="p">)[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">":"</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">summary</span> <span class="o">=</span> <span class="n">topic_model</span><span class="o">.</span><span class="n">get_topic_info</span><span class="p">(</span><span class="n">topic_id</span><span class="p">)[</span><span class="s1">'summary'</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">":"</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Title:"</span><span class="p">,</span> <span class="n">title</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Summary:"</span><span class="p">)</span>
<span class="c1"># Wrap the text to 80 characters</span>
<span class="n">wrapped_summary</span> <span class="o">=</span> <span class="n">textwrap</span><span class="o">.</span><span class="n">fill</span><span class="p">(</span><span class="n">summary</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mi">80</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">wrapped_summary</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'-'</span> <span class="o">*</span> <span class="mi">80</span><span class="p">)</span> <span class="c1"># Shorter separator line</span>
</pre></div>
<div id="cell-7" class="clipboard-copy-txt">import textwrap
# Show the summary representation of each topic
for topic_id in topic_info['Topic'].unique():
if topic_id == -1:
continue
print(f"\n📌 Topic {topic_id}:")
title = topic_model.get_topic_info(topic_id)['title'].values[0][0].split(":")[1].strip()
summary = topic_model.get_topic_info(topic_id)['summary'].values[0][0].split(":")[1].strip()
print("Title:", title)
print("Summary:")
# Wrap the text to 80 characters
wrapped_summary = textwrap.fill(summary, width=80)
print(wrapped_summary)
print('-' * 80) # Shorter separator line</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>
📌 Topic 0:
Title: Gerrard's Future Amid Chelsea Interest and Liverpool's Commitment
Summary:
Liverpool's chief executive Rick Parry has vowed that the club will not sell
star player Steven Gerrard, despite ongoing interest from Chelsea. Gerrard is
committed to winning trophies with Liverpool, while Parry acknowledges that
ultimately, the decision regarding his future rests with Gerrard himself.
--------------------------------------------------------------------------------
📌 Topic 1:
Title: Oscar and BAFTA Highlights
Summary:
The film "Vera Drake" is gaining recognition with three Academy Award
nominations, including Best Actress for Imelda Staunton, while Clint Eastwood's
"Million Dollar Baby" won top honors at the Oscars, edging out Martin Scorsese's
"The Aviator." At the 2005 BAFTA Film Awards, both "The Aviator" and "Vera
Drake" received significant accolades, with "The Aviator" being named Best Film.
--------------------------------------------------------------------------------
📌 Topic 2:
Title: Rugby Rivalries
Summary:
The recent events in rugby highlight the shifting landscape of Northern
Hemisphere rugby, with Ireland achieving a significant victory over England that
keeps their Grand Slam hopes alive. As the Six Nations tournament progresses,
Wales and Ireland emerge as strong contenders, challenging the dominance of
traditional powerhouses like France and England.
--------------------------------------------------------------------------------
📌 Topic 3:
Title: Celebrating Music Legends and Awards
Summary:
The music industry recently celebrated notable events, including posthumous
Grammy awards for Ray Charles and U2's ongoing success in maintaining their
status as a leading rock band. The Brit Awards saw Scissor Sisters take home
three international category prizes, while Joss Stone's win as best British
urban act sparked discussions about the definition of urban music.
--------------------------------------------------------------------------------
📌 Topic 4:
Title: Trends in Cybersecurity
Summary:
The increasing concern over cybersecurity has led to more women and older
individuals taking proactive measures to protect their home computers from
various online threats such as viruses and malware. As cyber crime escalates,
particularly targeting online gambling sites, the number of known viruses and
phishing attempts continues to grow significantly, underscoring the urgent need
for enhanced security measures.
--------------------------------------------------------------------------------
📌 Topic 5:
Title: Australian Open 2023
Summary:
The documents highlight the performances of several prominent tennis players
during the Australian Open, with Marat Safin expressing doubts about his future
at Wimbledon and Lindsay Davenport advancing confidently in the tournament.
Additionally, the pieces discuss Roger Federer's remarkable achievement of
winning three Grand Slams in a season and Tim Henman's upcoming match against
Cyril Saulnier in the first round of the Australian Open.
--------------------------------------------------------------------------------
📌 Topic 6:
Title: House Arrest for Terror Suspects
Summary:
The UK government is facing significant opposition to its plans to impose house
arrest on terrorism suspects, with various political parties, including the
Conservatives and Liberal Democrats, expressing serious concerns. Critics argue
that these control orders could undermine legal rights and represent a form of
tyranny, especially since such measures would be applied without trial due to
insufficient evidence.
--------------------------------------------------------------------------------
📌 Topic 7:
Title: Tensions Between Blair and Brown Amid Labour Unity Concerns
Summary:
Tensions between Tony Blair and Gordon Brown are highlighted as both politicians
face criticism from Labour MPs regarding a perceived rift, with Blair dismissing
reports of plans to resign before the next general election. In the midst of
these challenges, both leaders are making appeals for unity within the party to
secure a successful campaign for a third term in power.
--------------------------------------------------------------------------------
📌 Topic 8:
Title: The Rise of Broadband in the UK
Summary:
The popularity of broadband in the UK is increasing rapidly, with BT reporting a
significant rise in new connections, marking a record quarter. Additionally, BT
is expanding its services by venturing into television content distribution over
broadband, indicating a shift towards a more integrated digital media
experience.
--------------------------------------------------------------------------------
📌 Topic 9:
Title: Highlights of Recent British Television Shows and Awards
Summary:
Germaine Greer criticized the bullying tactics in Celebrity Big Brother after
leaving the show, highlighting concerns about the psychological impact on
participants. Meanwhile, the BBC's Little Britain won accolades at the British
Comedy Awards, while ITV's I'm a Celebrity experienced a significant drop in
viewership for its finale compared to previous seasons.
--------------------------------------------------------------------------------
📌 Topic 10:
Title: London Stock Exchange Takeover Talks
Summary:
The London Stock Exchange (LSE) is currently the focus of potential takeover
bids, with Deutsche Boerse reportedly planning to increase its offer to £1.5
billion and Euronext having held discussions to possibly launch a cash bid.
Meetings between LSE executives and both Deutsche Boerse and Euronext have
indicated ongoing interest in acquiring the exchange.
--------------------------------------------------------------------------------
📌 Topic 11:
Title: Worldcom Fraud Trial
Summary:
Former WorldCom chief financial officer Scott Sullivan testified in court,
admitting to lying to the board and willingness to commit fraud to meet Wall
Street expectations, implicating his ex-boss Bernie Ebbers in the $11 billion
financial fraud. Ebbers, however, denied knowledge of the manipulated financial
statements and the pressure to engage in fraudulent activities.
--------------------------------------------------------------------------------
📌 Topic 12:
Title: Yukos Oil Company
Summary:
Yukos, the Russian oil company, is embroiled in legal battles in the U.S. as it
fights against the Russian government's moves to break it up and auction off its
key production unit, Yugansk, to settle a massive tax bill. Amid accusations of
dishonesty in court, Yukos is also suing multiple firms for $20 billion,
claiming they played a role in the state auction of its assets.
--------------------------------------------------------------------------------
📌 Topic 13:
Title: European Indoor Athletics Championships
Summary:
The recent European indoor trials showcased strong prospects for European
championship medals, particularly highlighting athletes like Jason Gardner, who
secured titles in the men's 60m events. Jade Johnson also demonstrated her
competitive edge by narrowly defeating Kelly Sotherton in the long jump, as
athletes continue to make strides in preparation for upcoming championships.
--------------------------------------------------------------------------------
📌 Topic 14:
Title: The Future of Mobile Phones
Summary:
Mobile phones continue to see significant sales growth, with over 674 million
units sold globally last year, indicating strong demand for communication
technology. However, despite efforts to integrate music download services,
mobiles are not yet equipped to fully replace portable media players, as
consumers remain hesitant to forego their dedicated devices for multimedia
functions.
--------------------------------------------------------------------------------
📌 Topic 15:
Title: The Future of Peer-to-Peer Networks and Digital Media Piracy
Summary:
Peer-to-peer (P2P) networks are anticipated to persist and potentially be
leveraged by commercial media companies, especially following the resolution of
significant legal challenges against file-sharing individuals. Additionally,
major technology firms are collaborating to create a standardized system to
prevent piracy of digital music and video, which could lead to confusion among
users due to competing formats.
--------------------------------------------------------------------------------
📌 Topic 16:
Title: The Future of Video Gaming
Summary:
Electronic Arts (EA) aims to become the largest entertainment firm globally,
seeking to compete with industry giants like Disney through innovative gaming.
Meanwhile, the video gaming industry saw significant growth in 2004, and the
development of next-generation consoles presents both opportunities and
challenges for game developers.
--------------------------------------------------------------------------------
📌 Topic 17:
Title: UK Election Budget Strategies and Tax Policies
Summary:
Tony Blair has expressed support for Chancellor Gordon Brown's pre-budget report
amid criticism over its optimistic outlook on the UK economy, while Brown has
introduced measures like increasing the threshold for stamp duty and a one-off
council tax refund to appeal to voters. Meanwhile, Conservative leader Michael
Howard has announced a £4 billion tax cut plan should his party win the upcoming
election, although details regarding the specifics of the cuts remain unclear.
--------------------------------------------------------------------------------
📌 Topic 18:
Title: Trends and Challenges in Internet Search Habits and Technologies
Summary:
Recent reports highlight the complexities of user behavior in internet searches,
revealing that while the majority of searchers successfully find what they seek,
there remains a blend of naivety and sophistication among them. Additionally,
competition between tech giants like Microsoft and Google continues to shape the
evolution of search technology, leading to more personalized and efficient user
experiences.
--------------------------------------------------------------------------------
📌 Topic 19:
Title: Kenteris and Thanou
Summary:
Greek sprinters Kostas Kenteris and Katerina Thanou faced doping allegations
from the International Association of Athletics Federations (IAAF) for missing
drug tests, leading to provisional suspensions. However, they were ultimately
cleared of the charges by an independent tribunal, which ruled on their case
amidst the ongoing controversy surrounding doping in athletics.
--------------------------------------------------------------------------------
📌 Topic 20:
Title: Pension Reforms and Minimum Wage Increases
Summary:
The government is in talks with public sector unions to avert a series of
strikes over proposed pension reforms that could affect millions of workers.
Additionally, the minimum wage is set to rise to £5.05 in October, benefiting
over 1 million people.
--------------------------------------------------------------------------------
📌 Topic 21:
Title: Fiat-GM Negotiations and Fiat's Automotive Struggles
Summary:
Fiat and General Motors (GM) faced a critical deadline on February 1 to resolve
a disagreement regarding GM's obligation to purchase a majority stake in Fiat's
struggling car division. Rather than proceeding with the buyout, GM opted to pay
Fiat €1.55 billion ($2 billion) to exit the deal, while Fiat took proactive
measures by appointing a new CEO to revitalize its auto business.
--------------------------------------------------------------------------------
📌 Topic 22:
Title: US Economic Growth and Jobs Trends Amid Rising Interest Rates
Summary:
The U.S. economy experienced growth in the third quarter, driven by strong
consumer spending and adding 157,000 jobs despite a slight dip in hiring.
Additionally, the Federal Reserve raised interest rates to 2% for the fourth
time in five months in response to economic conditions.
--------------------------------------------------------------------------------
📌 Topic 23:
Title: US Dollar and Trade Deficit
Summary:
The US trade deficit has surged to over $60 billion, driven by a decline in
exports and an increase in imports, raising concerns about the economic outlook.
Concurrently, the US dollar has experienced significant fluctuations against
major currencies like the euro and yen, affected by central bank comments and
economic data, despite attempts by officials to stabilize its value.
--------------------------------------------------------------------------------
📌 Topic 24:
Title: UK Commitment to Global Aid and Debt Relief Initiatives
Summary:
UK Prime Minister Tony Blair has indicated that the government will
significantly increase aid to tsunami-affected countries and is facing calls to
double overseas aid to combat poverty and debt in developing nations.
Additionally, G7 finance ministers have supported a plan to relieve debt for the
world's poorest countries, as Chancellor Gordon Brown sets ambitious goals for
the UK's G8 presidency.
--------------------------------------------------------------------------------
📌 Topic 25:
Title: The Evolution of Gadgets
Summary:
The documents highlight the significance of the Cell processor and its backing
by major industry players like IBM, while also discussing the popularity of
gadgets like the iPod and the acclaim of the Apple PowerBook 100 as a
revolutionary portable computer. As the holiday season approaches, there are
predictions of a gadget shortage, with consumers eager to purchase sought-after
tech devices.
--------------------------------------------------------------------------------
📌 Topic 26:
Title: Economic Impact and Recovery in Southeast Asia Post-Disaster
Summary:
Following a devastating earthquake and tsunami that resulted in significant loss
of life and economic impact in Southern Asia, major reinsurers like Munich Re
and Swiss Re have sought to reassure the market regarding potential claims.
Despite the tragedy, stock markets in Indonesia, India, and Hong Kong reached
record highs, indicating investor confidence, while Thailand cut its economic
growth forecast as the region reevaluates the cost of damages and the ongoing
economic impact on tourism and reconstruction efforts.
--------------------------------------------------------------------------------
📌 Topic 27:
Title: Kelly Holmes
Summary:
The documents discuss Kelly Holmes's journey and achievements in athletics,
particularly her remarkable performances during the Olympic year in Athens,
where she won gold medals in the 800m and 1500m events. They also reflect on her
return to form and the mixed emotions experienced by athletics fans in 2004,
emphasizing her struggle with loneliness and competitive pressures.
--------------------------------------------------------------------------------
📌 Topic 28:
Title: December Retail Sales Trends and Holiday Shopping Insights
Summary:
UK retail sales saw a positive increase in November as Christmas shopping picked
up, while US retail figures showed mixed results, with a decline in car sales
impacting overall performance. December brought a surge in US retail sales,
largely driven by strong car sales, despite some retailers struggling to boost
sales and needing to cut prices.
--------------------------------------------------------------------------------
📌 Topic 29:
Title: Immigration and Asylum Reform Plans in the UK
Summary:
Home Secretary Charles Clarke is advocating for a points-based system for
economic migrants to the UK, which includes tighter border controls and a test
to assess contributions to the country. Meanwhile, Tory leader Michael Howard
has criticized the current asylum system's costs and proposed reforms he claims
will ensure fairness for legitimate refugees.
--------------------------------------------------------------------------------
📌 Topic 30:
Title: Kilroy-Silk's Political Shift
Summary:
Robert Kilroy-Silk, the former UK Independence Party MEP and chat show host, has
resigned from UKIP, expressing his shame and labeling the party as a "joke." He
has since launched his new political party, Veritas, with the aim of
transforming British politics and has gained a defector from UKIP, Damian
Hockney, as he embarks on this new venture.
--------------------------------------------------------------------------------
📌 Topic 31:
Title: UK Housing Market Trends and Interest Rate Effects
Summary:
The Bank of England has maintained interest rates at 4.75% amidst ongoing
fluctuations in the housing market, reflecting efforts to manage consumer debt
and the housing market's stability. Recent data reveals a slight dip in UK house
prices in November, followed by marginal increases in subsequent months,
indicating volatility in property values amid changing economic conditions.
--------------------------------------------------------------------------------
📌 Topic 32:
Title: Growth and Trends in the Gadget Market at CES 2005
Summary:
The gadget market is expected to experience an 11% growth in 2005, driven by the
ongoing explosion in consumer technology, as highlighted at the Consumer
Electronics Show (CES) in Las Vegas. Thousands of technology enthusiasts and
industry experts are attending the event to explore the latest innovations and
devices that will soon be available in stores.
--------------------------------------------------------------------------------
📌 Topic 33:
Title: Launch and Impact of the Nintendo DS Handheld Console in Europe
Summary:
Nintendo's new handheld console, the DS, is set to launch in Europe on March 11
for £99 (149 euros) and features touch-screen controls, targeting the growing
mobile gaming market. With its official debut, many UK stores opened at midnight
to allow eager gamers to purchase the device, as Nintendo aims to maintain its
leadership in the gaming industry.
--------------------------------------------------------------------------------
📌 Topic 34:
Title: Eurozone Economic Growth Trends and Challenges in 2004-2005
Summary:
The economic landscape in Europe has shown mixed signals, with the European
Central Bank maintaining interest rates amid growth concerns, while consumer
spending has driven growth in France. However, Germany's economy faced a
contraction, highlighting challenges in achieving a coordinated recovery across
the Eurozone.
--------------------------------------------------------------------------------
📌 Topic 35:
Title: Reforming the House of Lords
Summary:
Former Labour leader Neil Kinnock has been made a life peer in the House of
Lords, taking the title Baron Kinnock of Bedwellty. Meanwhile, Betty Boothroyd
has advocated for the establishment of a dedicated speaker for the House of
Lords, emphasizing the need for peers to lead reforms in the upper chamber.
--------------------------------------------------------------------------------
📌 Topic 36:
Title: Growth of India's Aviation Industry
Summary:
Air Deccan, India's first low-cost airline, is rapidly expanding its fleet with
significant deals, including the acquisition of 36 planes from ATR and an $1.8
billion order for 30 Airbus A320 aircraft. Meanwhile, Boeing is striving to
reclaim its industry leadership by unveiling its new long-distance 777-200LR,
capable of flying nearly 11,000 miles non-stop.
--------------------------------------------------------------------------------
📌 Topic 37:
Title: Corporate Changes and Mergers
Summary:
Laura Ashley's chief executive Ainul Mohd-Saaid is resigning for personal
reasons, effective February 1. In the telecommunications sector, MCI's shares
have risen amidst takeover speculation, with Verizon making a $6.8 billion bid
while Qwest is also pursuing acquisition talks.
--------------------------------------------------------------------------------
📌 Topic 38:
Title: Chinese Financial and Environmental Turmoil
Summary:
South Korea's LG Card faces potential liquidation by creditors if its former
parent company does not agree to a bailout, while in China, major developments
include the Three Gorges Project refusing to halt construction despite
governmental orders and the suspension of 26 power projects over environmental
concerns. Additionally, a significant financial scandal has emerged as two
senior officials at the Bank of China allegedly disappeared amid the loss of
$120 million in funds.
--------------------------------------------------------------------------------
📌 Topic 39:
Title: Economic Performance in Singapore and Japan (2004)
Summary:
In 2004, Singapore experienced significant economic growth of 8.1%, driven by a
strong manufacturing sector, while Japan faced economic challenges, slipping
into recession due to declining industrial output and weak exports. Business
confidence in Japan also deteriorated, reflecting concerns over rising oil
prices and a strong yen impacting the economy.
--------------------------------------------------------------------------------
</pre>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="conclusion-what-we-learned-from-bertopic-llms">✅ Conclusion: What We Learned from BERTopic + LLMs<a class="anchor-link" href="#conclusion-what-we-learned-from-bertopic-llms">¶</a></h2><p>With just a few lines of code and a strong pre-trained model, we’ve:</p>
<ul>
<li>Embedded our <strong>text corpus</strong> into a rich semantic space.</li>
<li><strong>Clustered</strong> more than 2000 news articles without any labels.</li>
<li>Used <strong>LLMs (via LiteLLM)</strong> to generate meaningful <strong>titles and summaries</strong> for each topic.</li>
<li>Built <strong>interactive visualizations</strong> and gained deep insight into the structure of our dataset.</li>
</ul>
<hr />
<h3 id="why-bertopic-is-powerful">🛠️ Why BERTopic is Powerful<a class="anchor-link" href="#why-bertopic-is-powerful">¶</a></h3><ul>
<li>🧠 <strong>Leverages BERT Knowledge</strong>: You don’t need to retrain from scratch — we reused deep semantic knowledge from BERT/MiniLM.</li>
<li>⚡ <strong>Plug-and-Play LLMs</strong>: We easily added GPT-4o to auto-generate human-readable titles and summaries.</li>
<li>📊 <strong>Interactive Visuals</strong>: Built-in UMAP plots and dendrograms help you <strong>explore, interpret, and explain</strong> your findings.</li>
<li>💻 <strong>Minimal Code, Maximum Insight</strong>: No need for manual preprocessing, labeling, or advanced clustering setup.</li>
</ul>
<hr />
<h3 id="limitations-and-considerations">⚠️ Limitations and Considerations<a class="anchor-link" href="#limitations-and-considerations">¶</a></h3><p>While BERTopic is great for <strong>exploratory analysis</strong>, keep in mind:</p>
<ul>
<li>🔁 <strong>Tracking Topics Over Time</strong>: Topic IDs may shift if you re-run or update the model. Not ideal for longitudinal studies unless using techniques like dynamic topic modeling.</li>
<li>🧪 <strong>Evaluation is Hard</strong>: Topic modeling is unsupervised — so evaluating quality is often subjective. Titles and summaries may <strong>seem good but misrepresent</strong> the content if LLMs hallucinate.</li>
<li>🧠 <strong>LLM Cost & Speed</strong>: Using LLMs adds inference cost and response latency — important if scaling to large datasets.</li>
<li>🔍 <strong>Over-Summarization Risk</strong>: With few documents and aggressive prompting, LLMs might <strong>oversimplify</strong> diverse themes in a topic.</li>
</ul>
<hr />
<h2 id="step-by-step-breakdown-how-bertopic-works">🔍 Step-by-Step Breakdown: How BERTopic Works<a class="anchor-link" href="#step-by-step-breakdown-how-bertopic-works">¶</a></h2><p>In the first half of this notebook, we focused on using BERTopic like a black box. Now it’s time to <strong>open that box</strong> and understand how it really works.</p>
<p>We'll unpack the full BERTopic pipeline into 5 key steps:</p>
<hr />
<h3 id="overview-of-the-pipeline">🧭 Overview of the Pipeline<a class="anchor-link" href="#overview-of-the-pipeline">¶</a></h3><table>
<thead>
<tr>
<th>Step</th>
<th>Component</th>
<th>Purpose</th>
</tr>
</thead>
<tbody>
<tr>
<td>1️⃣</td>
<td><strong>Document Embedding</strong></td>
<td>Convert raw texts into dense vectors using a sentence transformer model (e.g. MiniLM).</td>
</tr>
<tr>
<td>2️⃣</td>
<td><strong>Dimensionality Reduction</strong></td>
<td>Reduce high-dimensional embeddings using <strong>UMAP</strong> so we can visualize and cluster them more efficiently.</td>
</tr>
<tr>
<td>3️⃣</td>
<td><strong>Clustering</strong></td>
<td>Group similar documents using <strong>HDBSCAN</strong>, a density-based clustering algorithm.</td>
</tr>
<tr>
<td>4️⃣</td>
<td><strong>Topic Representation</strong></td>
<td>Extract keywords per topic using class-based TF-IDF (c-TF-IDF).</td>
</tr>
<tr>
<td>5️⃣</td>
<td><strong>LLM-Based Interpretation</strong></td>
<td>Use GPT (via LiteLLM) to generate titles and summaries for each cluster.</td>
</tr>
</tbody>
</table>
<hr />
<h3 id="what-well-do-next">🎯 What We’ll Do Next<a class="anchor-link" href="#what-well-do-next">¶</a></h3><p>We’ll now go through this pipeline <strong>step by step</strong>:</p>
<ul>
<li>First we’ll embed the documents.</li>
<li>Then reduce their dimensionality.</li>
<li>Followed by clustering and topic extraction.</li>
<li>Finally, we’ll regenerate topics using a c-TF-IDF representation — and compare it to the LLM-based summaries.</li>
</ul>
<h3 id="step-1-sentence-embeddings">📌 Step 1: Sentence Embeddings<a class="anchor-link" href="#step-1-sentence-embeddings">¶</a></h3><p>The first step is to turn each document (news article) into a <strong>dense vector</strong> using a pre-trained sentence transformer.</p>
<p>We’ll use the <code>all-MiniLM-L6-v2</code> model from Hugging Face — a lightweight yet strong model trained specifically for semantic similarity tasks.</p>
<h4 id="why-sentence-embeddings">Why Sentence Embeddings?<a class="anchor-link" href="#why-sentence-embeddings">¶</a></h4><ul>
<li>These embeddings capture <strong>meaning</strong>, not just surface form.</li>
<li>They place semantically similar sentences <strong>closer</strong> in vector space.</li>
<li>Unlike bag-of-words models, they can generalize across wording variations.</li>
</ul>
<p>Let’s compute embeddings for each document in our BBC dataset.</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [8]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-8", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">sentence_transformers</span><span class="w"> </span><span class="kn">import</span> <span class="n">SentenceTransformer</span>
<span class="c1"># Load the sentence transformer model</span>
<span class="n">embedding_model</span> <span class="o">=</span> <span class="n">SentenceTransformer</span><span class="p">(</span><span class="s2">"all-MiniLM-L6-v2"</span><span class="p">)</span>
<span class="c1"># Convert documents into sentence embeddings</span>
<span class="n">embeddings</span> <span class="o">=</span> <span class="n">embedding_model</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s2">"document"</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">show_progress_bar</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># Check the shape</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"✅ Shape of embeddings:"</span><span class="p">,</span> <span class="n">embeddings</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</pre></div>
<div id="cell-8" class="clipboard-copy-txt">from sentence_transformers import SentenceTransformer
# Load the sentence transformer model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# Convert documents into sentence embeddings
embeddings = embedding_model.encode(df["document"].tolist(), show_progress_bar=True)
# Check the shape
print("✅ Shape of embeddings:", embeddings.shape)</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output " data-mime-type="text/plain">
<pre>Batches: 0%| | 0/70 [00:00<?, ?it/s]</pre>
</div>
</div>
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>✅ Shape of embeddings: (2225, 384)
</pre>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h4 id="example-cosine-similarity-in-embedding-space">🔍 Example: Cosine Similarity in Embedding Space<a class="anchor-link" href="#example-cosine-similarity-in-embedding-space">¶</a></h4><p>Let’s explore what these sentence embeddings really mean by computing <strong>cosine similarity</strong> between a few examples:</p>
<ul>
<li>We'll take two documents from the <strong>same topic</strong> and see how similar their embeddings are.</li>
<li>Then, we’ll compare two documents from <strong>different topics</strong> to see the contrast.</li>
</ul>
<p>The cosine similarity score ranges from <strong>-1 to 1</strong>:</p>
<ul>
<li><code>1</code> → Perfectly similar</li>
<li><code>0</code> → No similarity</li>
<li><code>-1</code> → Opposite directions (rare in this case)</li>
</ul>
<p>This helps us confirm that embeddings from the same topic cluster together in semantic space.</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [9]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-9", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">sklearn.metrics.pairwise</span><span class="w"> </span><span class="kn">import</span> <span class="n">cosine_similarity</span>
<span class="kn">import</span><span class="w"> </span><span class="nn">numpy</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">np</span>
<span class="c1"># Select two documents from the same topic (e.g., label == 'tech')</span>
<span class="n">label_name</span> <span class="o">=</span> <span class="s2">"tech"</span>
<span class="n">same_label_docs</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s2">"label_text"</span><span class="p">]</span> <span class="o">==</span> <span class="n">label_name</span><span class="p">]</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">same_label_docs</span><span class="p">[</span><span class="s2">"document"</span><span class="p">]:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Document: </span><span class="si">{</span><span class="n">doc</span><span class="p">[:</span><span class="mi">100</span><span class="p">]</span><span class="si">}</span><span class="s2">..."</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'-'</span><span class="o">*</span><span class="mi">100</span><span class="p">)</span>
<span class="n">same_embs</span> <span class="o">=</span> <span class="n">embedding_model</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">same_label_docs</span><span class="p">[</span><span class="s2">"document"</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span>
<span class="c1"># Cosine similarity</span>
<span class="n">sim_same</span> <span class="o">=</span> <span class="n">cosine_similarity</span><span class="p">([</span><span class="n">same_embs</span><span class="p">[</span><span class="mi">0</span><span class="p">]],</span> <span class="p">[</span><span class="n">same_embs</span><span class="p">[</span><span class="mi">1</span><span class="p">]])[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"🧠 Same-topic docs ('</span><span class="si">{</span><span class="n">label_name</span><span class="si">}</span><span class="s2">') similarity: </span><span class="si">{</span><span class="n">sim_same</span><span class="si">:</span><span class="s2">.3f</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'-'</span><span class="o">*</span><span class="mi">100</span><span class="p">)</span>
<span class="c1"># Select two documents from different topics</span>
<span class="n">doc1</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s2">"label_text"</span><span class="p">]</span> <span class="o">==</span> <span class="s2">"sport"</span><span class="p">]</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">doc2</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s2">"label_text"</span><span class="p">]</span> <span class="o">==</span> <span class="s2">"business"</span><span class="p">]</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">43</span><span class="p">)</span>
<span class="n">diff_embs</span> <span class="o">=</span> <span class="n">embedding_model</span><span class="o">.</span><span class="n">encode</span><span class="p">([</span><span class="n">doc1</span><span class="p">[</span><span class="s2">"document"</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">doc2</span><span class="p">[</span><span class="s2">"document"</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]])</span>
<span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="p">[</span><span class="n">doc1</span><span class="p">[</span><span class="s2">"document"</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">doc2</span><span class="p">[</span><span class="s2">"document"</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]]:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Document: </span><span class="si">{</span><span class="n">doc</span><span class="p">[:</span><span class="mi">100</span><span class="p">]</span><span class="si">}</span><span class="s2">..."</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'-'</span><span class="o">*</span><span class="mi">100</span><span class="p">)</span>
<span class="n">sim_diff</span> <span class="o">=</span> <span class="n">cosine_similarity</span><span class="p">([</span><span class="n">diff_embs</span><span class="p">[</span><span class="mi">0</span><span class="p">]],</span> <span class="p">[</span><span class="n">diff_embs</span><span class="p">[</span><span class="mi">1</span><span class="p">]])[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># Print results</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"🔀 Different-topic docs similarity: </span><span class="si">{</span><span class="n">sim_diff</span><span class="si">:</span><span class="s2">.3f</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</pre></div>
<div id="cell-9" class="clipboard-copy-txt">from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Select two documents from the same topic (e.g., label == 'tech')
label_name = "tech"
same_label_docs = df[df["label_text"] == label_name].sample(2, random_state=1)
for doc in same_label_docs["document"]:
print(f"Document: {doc[:100]}...")
print('-'*100)
same_embs = embedding_model.encode(same_label_docs["document"].tolist())
# Cosine similarity
sim_same = cosine_similarity([same_embs[0]], [same_embs[1]])[0][0]
print(f"🧠 Same-topic docs ('{label_name}') similarity: {sim_same:.3f}")
print('-'*100)
# Select two documents from different topics
doc1 = df[df["label_text"] == "sport"].sample(1, random_state=42)
doc2 = df[df["label_text"] == "business"].sample(1, random_state=43)
diff_embs = embedding_model.encode([doc1["document"].values[0], doc2["document"].values[0]])
for doc in [doc1["document"].values[0], doc2["document"].values[0]]:
print(f"Document: {doc[:100]}...")
print('-'*100)
sim_diff = cosine_similarity([diff_embs[0]], [diff_embs[1]])[0][0]
# Print results
print(f"🔀 Different-topic docs similarity: {sim_diff:.3f}")</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>Document: dublin hi-tech labs to shut down dublin s hi-tech research laboratory media labs europe is to shut...
----------------------------------------------------------------------------------------------------
Document: google launches tv search service the net search giant google has launched a search service that let...
----------------------------------------------------------------------------------------------------
🧠 Same-topic docs ('tech') similarity: 0.182
----------------------------------------------------------------------------------------------------
Document: sa return to mauritius top seeds south africa return to the scene of one of their most embarrassing ...
----------------------------------------------------------------------------------------------------
Document: nasdaq planning $100m-share sale the owner of the technology-dominated nasdaq stock index plans to s...
----------------------------------------------------------------------------------------------------
🔀 Different-topic docs similarity: -0.028
</pre>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<p>This will usually show that same-topic similarity is substantially higher — which is the foundation for clustering later.</p>
<hr />
<h3 id="step-2-dimensionality-reduction-with-pca-umap">🌐 Step 2: Dimensionality Reduction with PCA + UMAP<a class="anchor-link" href="#step-2-dimensionality-reduction-with-pca-umap">¶</a></h3><p>Now that we have high-dimensional sentence embeddings, we want to <strong>reduce their dimensionality</strong> to make them easier to cluster and visualize.</p>
<h4 id="why-reduce-dimensionality">🤔 Why Reduce Dimensionality?<a class="anchor-link" href="#why-reduce-dimensionality">¶</a></h4><ul>
<li>Sentence embeddings typically live in a <strong>384- or 768-dimensional space</strong>.</li>
<li>Clustering directly in high dimensions is often noisy and less meaningful:<ul>
<li>Distance metrics like cosine or Euclidean become unreliable due to the <strong>curse of dimensionality</strong>.</li>
</ul>
</li>
<li>Reducing to <strong>fewer dimensions</strong> (e.g. 5–50) helps create more <strong>compact and distinct clusters</strong>.</li>
</ul>
<h4 id="pca-principal-component-analysis">🧮 PCA (Principal Component Analysis)<a class="anchor-link" href="#pca-principal-component-analysis">¶</a></h4><ul>
<li>As a preprocessing step, we apply PCA to <strong>denoise</strong> the embeddings.</li>
<li>Keeps the most important directions of variance.</li>
</ul>
<h4 id="umap-uniform-manifold-approximation-and-projection">🌈 UMAP (Uniform Manifold Approximation and Projection)<a class="anchor-link" href="#umap-uniform-manifold-approximation-and-projection">¶</a></h4><ul>
<li>UMAP then projects the data into an even <strong>lower-dimensional latent space</strong> (e.g. 5D) that preserves local and global structure.</li>
<li>This makes clustering more effective and robust.</li>
</ul>
<p>THis alrogithm is really fast and efficient, and it's a great way to reduce the dimensionality of the data. To know more about UMAP, you can read the <a href="https://umap-learn.readthedocs.io/en/latest/index.html">official documentation</a>. Disclaimer: this is tricky !</p>
<p>Let’s apply both and visualize the 2D representation too!</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [10]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-10", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">sklearn.decomposition</span><span class="w"> </span><span class="kn">import</span> <span class="n">PCA</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">umap</span><span class="w"> </span><span class="kn">import</span> <span class="n">UMAP</span>
<span class="kn">import</span><span class="w"> </span><span class="nn">matplotlib.pyplot</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">plt</span>
<span class="c1"># Step 1: PCA to reduce noise</span>
<span class="n">pca</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">embeddings_pca</span> <span class="o">=</span> <span class="n">pca</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">embeddings</span><span class="p">)</span>
<span class="c1"># Step 2: UMAP for nonlinear dimensionality reduction</span>
<span class="n">umap_model</span> <span class="o">=</span> <span class="n">UMAP</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">embeddings_umap</span> <span class="o">=</span> <span class="n">umap_model</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">embeddings_pca</span><span class="p">)</span>
<span class="n">unique_labels</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s2">"label_text"</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span>
<span class="n">colors</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">tab10</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">unique_labels</span><span class="p">)))</span>
<span class="c1"># Plot 2D projection</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">label</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">unique_labels</span><span class="p">):</span>
<span class="n">indices</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s2">"label_text"</span><span class="p">]</span> <span class="o">==</span> <span class="n">label</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span>
<span class="n">embeddings_umap</span><span class="p">[</span><span class="n">indices</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span>
<span class="n">embeddings_umap</span><span class="p">[</span><span class="n">indices</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span>
<span class="n">label</span><span class="o">=</span><span class="n">label</span><span class="p">,</span>
<span class="n">alpha</span><span class="o">=</span><span class="mf">0.6</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"UMAP Projection of Document Embeddings"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s2">"UMAP-1"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s2">"UMAP-2"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s2">"Label"</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="s2">"best"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
<div id="cell-10" class="clipboard-copy-txt">from sklearn.decomposition import PCA
from umap import UMAP
import matplotlib.pyplot as plt
# Step 1: PCA to reduce noise
pca = PCA(n_components=50, random_state=42)
embeddings_pca = pca.fit_transform(embeddings)
# Step 2: UMAP for nonlinear dimensionality reduction
umap_model = UMAP(n_components=10, random_state=42)
embeddings_umap = umap_model.fit_transform(embeddings_pca)
unique_labels = df["label_text"].unique()
colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))
# Plot 2D projection
plt.figure(figsize=(10, 6))
for i, label in enumerate(unique_labels):
indices = df["label_text"] == label
plt.scatter(
embeddings_umap[indices, 0],
embeddings_umap[indices, 1],
label=label,
alpha=0.6,
color=colors[i]
)
plt.title("UMAP Projection of Document Embeddings")
plt.xlabel("UMAP-1")
plt.ylabel("UMAP-2")
plt.legend(title="Label", loc="best")
plt.show()</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedImage jp-OutputArea-output ">
<img src=""
class="
"
>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<p>This UMAP projection shows a very promising representation of document embeddings — here's a breakdown of what we observe:</p>
<h3 id="what-we-see">✅ <strong>What We See</strong><a class="anchor-link" href="#what-we-see">¶</a></h3><ul>
<li><p><strong>Clear Clustering by Label</strong>:</p>
<ul>
<li>Each group of points is is easily distinguishable, even if they are not perfectly separated, which means the model has learned <strong>semantic distinctions</strong> between topics.</li>
<li>This is especially impressive considering we’re using <strong>zero fine-tuning</strong> — only the raw semantic embeddings.</li>
</ul>
</li>
</ul>
<h3 id="why-this-matters">🔍 Why This Matters<a class="anchor-link" href="#why-this-matters">¶</a></h3><ul>
<li>This visualization confirms that <strong>semantic embeddings from pretrained transformers</strong> already contain <strong>enough structure</strong> to group documents by topic.</li>
<li>This is what allows BERTopic and other clustering tools to operate effectively with little or no supervision.</li>
</ul>
<hr />
<h3 id="step-3-clustering-with-hdbscan">📊 Step 3: Clustering with HDBSCAN<a class="anchor-link" href="#step-3-clustering-with-hdbscan">¶</a></h3><p>Now that we have semantically meaningful 10-dimensional representations (via PCA + UMAP), it's time to <strong>identify topic clusters</strong> using an unsupervised clustering algorithm.</p>
<h4 id="why-hdbscan">🤖 Why HDBSCAN?<a class="anchor-link" href="#why-hdbscan">¶</a></h4><p>HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is especially well-suited for topic modeling:</p>
<ul>
<li>✅ <strong>No need to pre-specify the number of clusters</strong>.</li>
<li>✅ Can find <strong>clusters of varying shapes and densities</strong>.</li>
<li>✅ Automatically labels low-confidence documents as <strong>noise</strong> (topic -1).</li>
<li>✅ Scales well with real-world text data.</li>
</ul>
<p>This makes HDBSCAN much more flexible than k-means or DBSCAN when working with high-dimensional language embeddings. To know more about HDBSCAN, you can read the <a href="https://hdbscan.readthedocs.io/en/latest/index.html">official documentation</a>.</p>
<hr />
<h4 id="goal">🧠 Goal<a class="anchor-link" href="#goal">¶</a></h4><p>Group together documents that are semantically similar — using the <strong>dense embedding space</strong> created from MiniLM → PCA → UMAP.</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [11]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-11", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">hdbscan</span>
<span class="c1"># Fit HDBSCAN on the reduced embeddings</span>
<span class="n">hdb</span> <span class="o">=</span> <span class="n">hdbscan</span><span class="o">.</span><span class="n">HDBSCAN</span><span class="p">(</span><span class="n">min_cluster_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">prediction_data</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">clusters</span> <span class="o">=</span> <span class="n">hdb</span><span class="o">.</span><span class="n">fit_predict</span><span class="p">(</span><span class="n">embeddings_umap</span><span class="p">)</span>
<span class="c1"># Attach the cluster labels to our original DataFrame</span>
<span class="n">df</span><span class="p">[</span><span class="s2">"cluster"</span><span class="p">]</span> <span class="o">=</span> <span class="n">clusters</span>
<span class="c1"># Quick count of discovered clusters</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"🔍 HDBSCAN found </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">clusters</span><span class="p">))</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="o">-</span><span class="mi">1</span><span class="w"> </span><span class="ow">in</span><span class="w"> </span><span class="n">clusters</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="mi">0</span><span class="p">)</span><span class="si">}</span><span class="s2"> clusters."</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s2">"cluster"</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">())</span>
</pre></div>
<div id="cell-11" class="clipboard-copy-txt">import hdbscan
# Fit HDBSCAN on the reduced embeddings
hdb = hdbscan.HDBSCAN(min_cluster_size=15, prediction_data=True)
clusters = hdb.fit_predict(embeddings_umap)
# Attach the cluster labels to our original DataFrame
df["cluster"] = clusters
# Quick count of discovered clusters
print(f"🔍 HDBSCAN found {len(set(clusters)) - (1 if -1 in clusters else 0)} clusters.")
print(df["cluster"].value_counts())</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>🔍 HDBSCAN found 7 clusters.
cluster
0 914
1 792
5 186
6 136
4 87
3 67
2 30
-1 13
Name: count, dtype: int64
</pre>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h4 id="quick-look-at-clustering-results">🧭 Quick Look at Clustering Results<a class="anchor-link" href="#quick-look-at-clustering-results">¶</a></h4><p>HDBSCAN has just grouped our documents into <strong>semantic clusters</strong> based on their embeddings.</p>
<p>Here’s what we found:</p>
<ul>
<li>🧠 <strong>7 distinct clusters</strong> (excluding noise).</li>
<li>🚫 <strong>13 documents</strong> were marked as noise (<code>cluster = -1</code>) — they didn’t fit confidently into any cluster.</li>
<li>📊 <strong>Cluster 0</strong> and <strong>Cluster 1</strong> dominate the dataset — containing 900+ and 700+ documents respectively.</li>
</ul>
<p>Let’s visualize how these clusters are distributed in the embedding space!</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [12]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-12", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">seaborn</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">sns</span>
<span class="kn">import</span><span class="w"> </span><span class="nn">matplotlib.pyplot</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">plt</span>
<span class="c1"># Visualize discovered clusters</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">sns</span><span class="o">.</span><span class="n">scatterplot</span><span class="p">(</span>
<span class="n">x</span><span class="o">=</span><span class="n">embeddings_umap</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span>
<span class="n">y</span><span class="o">=</span><span class="n">embeddings_umap</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span>
<span class="n">hue</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s2">"cluster"</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">),</span>
<span class="n">palette</span><span class="o">=</span><span class="s2">"tab10"</span><span class="p">,</span>
<span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span>
<span class="n">legend</span><span class="o">=</span><span class="s2">"full"</span>
<span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"UMAP Projection Colored by HDBSCAN Clusters"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s2">"UMAP-1"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s2">"UMAP-2"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s2">"Cluster"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
<div id="cell-12" class="clipboard-copy-txt">import seaborn as sns
import matplotlib.pyplot as plt
# Visualize discovered clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(
x=embeddings_umap[:, 0],
y=embeddings_umap[:, 1],
hue=df["cluster"].astype(str),
palette="tab10",
alpha=0.7,
legend="full"
)
plt.title("UMAP Projection Colored by HDBSCAN Clusters")
plt.xlabel("UMAP-1")
plt.ylabel("UMAP-2")
plt.legend(title="Cluster")
plt.show()
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedImage jp-OutputArea-output ">
<img src=""
class="
"
>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h4 id="reflecting-on-the-number-of-clusters">🪞 Reflecting on the Number of Clusters<a class="anchor-link" href="#reflecting-on-the-number-of-clusters">¶</a></h4><p>Our clustering algorithm (HDBSCAN) identified <strong>7 main clusters</strong> in the semantic space — a relatively low number considering the diversity of the BBC news dataset. Also we see that the clusters are not perfectly separated when looking at the whole dataset at the same time.</p>
<p>This can be both a strength and a limitation:</p>
<ul>
<li>✅ <strong>Strength</strong>: These clusters represent <strong>broad, high-level themes</strong>, giving a clean summary of the corpus.</li>
<li>⚠️ <strong>Limitation</strong>: If we want <strong>finer-grained insights</strong>, like subtopics or emerging trends, we may need to <strong>dive deeper</strong>.</li>
</ul>
<p>👉 That’s one of the main strengths of the full <strong>BERTopic pipeline</strong>:</p>
<ul>
<li>The parametrization is well set to find a good balance between granularity and separation.</li>
<li>To get more clusters, we can either increase the <code>min_cluster_size</code> or the <code>n_components</code> in the UMAP.</li>
</ul>
<p>Now, let’s interpret what each of our current clusters means by <strong>extracting representative keywords</strong> using the <strong>c-TF-IDF</strong> technique — a cornerstone of BERTopic's representation strategy.</p>
<hr />
<h3 id="step-4-topic-representation-using-c-tf-idf">🧠 Step 4: Topic Representation using c-TF-IDF<a class="anchor-link" href="#step-4-topic-representation-using-c-tf-idf">¶</a></h3><p>Once documents are grouped into clusters, the next challenge is:<br />
➡️ <strong>How do we describe each topic?</strong></p>
<p>Instead of simply counting word frequency, BERTopic uses a smarter approach called <strong>class-based TF-IDF (c-TF-IDF)</strong>:</p>
<hr />
<h4 id="what-is-c-tf-idf">🔍 What is c-TF-IDF?<a class="anchor-link" href="#what-is-c-tf-idf">¶</a></h4><p>c-TF-IDF adapts the traditional TF-IDF logic to topic modeling:</p>
<ul>
<li>Imagine each topic is one big document (by concatenating all its documents).</li>
<li>Then, compute the TF-IDF score of words within that “topic-document” relative to all others.</li>
<li>This highlights <strong>keywords that are specific to one topic</strong>, not just common overall.</li>
</ul>
<hr />
<h4 id="why-use-c-tf-idf">📌 Why Use c-TF-IDF?<a class="anchor-link" href="#why-use-c-tf-idf">¶</a></h4><ul>
<li>✅ It captures <strong>topic-specific terminology</strong>.</li>
<li>✅ It works well even for <strong>short texts</strong> (like news headlines).</li>
<li>✅ It’s <strong>unsupervised</strong> and leverages your cluster labels to build meaning.</li>
</ul>
<p>Now let’s compute the top terms per topic using this technique and look at what defines each cluster semantically.</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [13]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-13", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">bertopic.vectorizers</span><span class="w"> </span><span class="kn">import</span> <span class="n">ClassTfidfTransformer</span>
<span class="c1"># Step 1: Create pseudo-documents per cluster</span>
<span class="n">docs_per_topic</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">topic</span> <span class="ow">in</span> <span class="nb">set</span><span class="p">(</span><span class="n">clusters</span><span class="p">):</span>
<span class="k">if</span> <span class="n">topic</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="k">continue</span> <span class="c1"># Skip outliers</span>
<span class="n">topic_docs</span> <span class="o">=</span> <span class="p">[</span><span class="n">doc</span> <span class="k">for</span> <span class="n">doc</span><span class="p">,</span> <span class="n">label</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s2">"document"</span><span class="p">],</span> <span class="n">clusters</span><span class="p">)</span> <span class="k">if</span> <span class="n">label</span> <span class="o">==</span> <span class="n">topic</span><span class="p">]</span>
<span class="n">docs_per_topic</span><span class="p">[</span><span class="n">topic</span><span class="p">]</span> <span class="o">=</span> <span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">topic_docs</span><span class="p">)</span>
<span class="c1"># Step 2: Extract raw vocabulary</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">sklearn.feature_extraction.text</span><span class="w"> </span><span class="kn">import</span> <span class="n">CountVectorizer</span>
<span class="n">vectorizer</span> <span class="o">=</span> <span class="n">CountVectorizer</span><span class="p">(</span><span class="n">stop_words</span><span class="o">=</span><span class="s2">"english"</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">docs_per_topic</span><span class="o">.</span><span class="n">values</span><span class="p">())</span>
<span class="c1"># Step 3: Compute c-TF-IDF</span>
<span class="n">ctfidf</span> <span class="o">=</span> <span class="n">ClassTfidfTransformer</span><span class="p">()</span>
<span class="c1"># Fix: Don't pass the document texts as the multiplier</span>
<span class="n">X_ctfidf</span> <span class="o">=</span> <span class="n">ctfidf</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="c1"># Step 4: Get top terms per topic</span>
<span class="k">def</span><span class="w"> </span><span class="nf">get_top_n_words_per_topic</span><span class="p">(</span><span class="n">X_ctfidf</span><span class="p">,</span> <span class="n">vectorizer</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
<span class="n">words</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">get_feature_names_out</span><span class="p">()</span>
<span class="n">top_words</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">topic_idx</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">X_ctfidf</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]):</span>
<span class="n">row</span> <span class="o">=</span> <span class="n">X_ctfidf</span><span class="p">[</span><span class="n">topic_idx</span><span class="p">]</span><span class="o">.</span><span class="n">toarray</span><span class="p">()</span><span class="o">.</span><span class="n">flatten</span><span class="p">()</span>
<span class="n">top_n</span> <span class="o">=</span> <span class="n">row</span><span class="o">.</span><span class="n">argsort</span><span class="p">()[</span><span class="o">-</span><span class="n">n</span><span class="p">:][::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">top_words</span><span class="p">[</span><span class="n">topic_idx</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">words</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">top_n</span><span class="p">]</span>
<span class="k">return</span> <span class="n">top_words</span>
<span class="n">top_words</span> <span class="o">=</span> <span class="n">get_top_n_words_per_topic</span><span class="p">(</span><span class="n">X_ctfidf</span><span class="p">,</span> <span class="n">vectorizer</span><span class="p">)</span>
<span class="k">for</span> <span class="n">topic</span><span class="p">,</span> <span class="n">words</span> <span class="ow">in</span> <span class="n">top_words</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"🧩 Topic </span><span class="si">{</span><span class="n">topic</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="s1">', '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">words</span><span class="p">)</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</pre></div>
<div id="cell-13" class="clipboard-copy-txt">from bertopic.vectorizers import ClassTfidfTransformer
# Step 1: Create pseudo-documents per cluster
docs_per_topic = {}
for topic in set(clusters):
if topic == -1:
continue # Skip outliers
topic_docs = [doc for doc, label in zip(df["document"], clusters) if label == topic]
docs_per_topic[topic] = " ".join(topic_docs)
# Step 2: Extract raw vocabulary
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(docs_per_topic.values())
# Step 3: Compute c-TF-IDF
ctfidf = ClassTfidfTransformer()
# Fix: Don't pass the document texts as the multiplier
X_ctfidf = ctfidf.fit_transform(X)
# Step 4: Get top terms per topic
def get_top_n_words_per_topic(X_ctfidf, vectorizer, n=10):
words = vectorizer.get_feature_names_out()
top_words = {}
for topic_idx in range(X_ctfidf.shape[0]):
row = X_ctfidf[topic_idx].toarray().flatten()
top_n = row.argsort()[-n:][::-1]
top_words[topic_idx] = [words[i] for i in top_n]
return top_words
top_words = get_top_n_words_per_topic(X_ctfidf, vectorizer)
for topic, words in top_words.items():
print(f"🧩 Topic {topic}: {', '.join(words)}")</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>🧩 Topic 0: said, mr, government, labour, year, election, new, people, blair, party
🧩 Topic 1: said, people, film, music, new, best, year, mobile, tv, technology
🧩 Topic 2: kenteris, iaaf, doping, drugs, thanou, greek, conte, athletes, test, athens
🧩 Topic 3: race, olympic, indoor, world, champion, holmes, championships, year, european, athens
🧩 Topic 4: open, roddick, seed, set, australian, win, match, final, nadal, tennis
🧩 Topic 5: club, chelsea, united, arsenal, liverpool, league, said, game, manager, football
🧩 Topic 6: england, wales, ireland, rugby, france, nations, game, half, coach, robinson
</pre>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h3 id="step-5-naming-topics-with-keywords-or-llm">🏷️ Step 5: Naming Topics with Keywords or LLM<a class="anchor-link" href="#step-5-naming-topics-with-keywords-or-llm">¶</a></h3><p>Now that we’ve extracted <strong>top keywords per topic</strong> using c-TF-IDF, which was used for the first version of BERTopic, the next step is to give each cluster a human-readable name which is possible thanks to the <code>representation_model</code> parameter in BERTopic but most especially because of the LLMs.</p>
<hr />
<h4 id="manual-naming-basic-approach">✏️ Manual Naming (Basic Approach)<a class="anchor-link" href="#manual-naming-basic-approach">¶</a></h4><p>You could simply:</p>
<ul>
<li>Look at the <strong>top 5–10 keywords</strong> per topic.</li>
<li>Assign a <strong>human-readable label</strong> based on those terms.</li>
</ul>
<blockquote>
<p>For example, a cluster with words like <code>economy</code>, <code>stocks</code>, <code>market</code>, <code>inflation</code> → likely represents a topic about <strong>Business & Finance</strong>.</p>
</blockquote>
<hr />
<h4 id="llm-based-naming-optional-smart-approach">🤖 LLM-Based Naming (Optional, Smart Approach)<a class="anchor-link" href="#llm-based-naming-optional-smart-approach">¶</a></h4><p>We can go further and use an LLM (like GPT-4 or Claude) to:</p>
<ul>
<li>Read the keywords + a few sample docs.</li>
<li>Generate a <strong>short, descriptive title</strong> for each cluster.</li>
</ul>
<p>This is what BERTopic does under the hood with <code>representation_model=LiteLLM(...)</code>.</p>
<hr />
<p>Let’s now write a helper function to name our topics automatically by summarizing the top keywords per cluster. Later, we’ll also look at a few sample docs from each cluster for deeper insights.</p>
</div>
</div>
</div>
</div><div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell jp-CodeCell jp-Notebook-cell ">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea">
<div class="jp-InputPrompt jp-InputArea-prompt">In [15]:</div><div class="jp-CodeMirrorEditor jp-Editor jp-InputArea-editor" data-type="inline">
<div class="CodeMirror cm-s-jupyter">
<div class="zeroclipboard-container">
<clipboard-copy for="cell-14", aria-label="Copy to Clipboard">
<div>
<span class="notice" hidden>Copied!</span>
<svg aria-hidden="true" width="20" height="20" viewBox="0 0 16 16" version="1.1" data-view-component="true" class="clipboard-copy-icon">
<path fill="currentColor" fill-rule="evenodd" d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 010 1.5h-1.5a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-1.5a.75.75 0 011.5 0v1.5A1.75 1.75 0 019.25 16h-7.5A1.75 1.75 0 010 14.25v-7.5z"></path>
<path fill="currentColor" fill-rule="evenodd" d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0114.25 11h-7.5A1.75 1.75 0 015 9.25v-7.5zm1.75-.25a.25.25 0 00-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 00.25-.25v-7.5a.25.25 0 00-.25-.25h-7.5z"></path>
</svg>
</div>
</clipboard-copy>
</div>
<div class="highlight-ipynb hl-python "><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">litellm</span><span class="w"> </span><span class="kn">import</span> <span class="n">completion</span>
<span class="kn">import</span><span class="w"> </span><span class="nn">time</span>
<span class="c1"># Generate a title for each topic</span>
<span class="n">topic_titles</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">topic_id</span><span class="p">,</span> <span class="n">keywords</span> <span class="ow">in</span> <span class="n">top_words</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">docs</span> <span class="o">=</span> <span class="n">docs_per_topic</span><span class="p">[</span><span class="n">topic_id</span><span class="p">][:</span><span class="mi">5</span><span class="p">]</span> <span class="c1"># take 5 example docs</span>
<span class="c1"># Create a simple prompt with the documents and keywords</span>
<span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"""</span>
<span class="s2"> I have a topic that contains the following documents:</span>
<span class="s2"> </span><span class="si">{</span><span class="n">docs</span><span class="si">}</span>
<span class="s2"> </span>
<span class="s2"> The topic is described by the following keywords: </span><span class="si">{</span><span class="n">keywords</span><span class="si">}</span>
<span class="s2"> </span>
<span class="s2"> Based on the above information, please provide a concise title for this topic.</span>
<span class="s2"> </span>
<span class="s2"> Format:</span>
<span class="s2"> title: <title></span>
<span class="s2"> """</span>
<span class="c1"># Call gpt-4o-mini directly</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">completion</span><span class="p">(</span>
<span class="n">model</span><span class="o">=</span><span class="s2">"gpt-4o-mini"</span><span class="p">,</span>
<span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="s2">"role"</span><span class="p">:</span> <span class="s2">"user"</span><span class="p">,</span> <span class="s2">"content"</span><span class="p">:</span> <span class="n">prompt</span><span class="p">}]</span>
<span class="p">)</span>
<span class="c1"># Extract the title</span>
<span class="n">response_text</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">content</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">response_text</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">"title:"</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">if</span> <span class="s2">"title:"</span> <span class="ow">in</span> <span class="n">response_text</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="k">else</span> <span class="s2">"Unknown Title"</span>
<span class="n">topic_titles</span><span class="p">[</span><span class="n">topic_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">title</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"📌 Topic </span><span class="si">{</span><span class="n">topic_id</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="n">title</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="c1"># Add delay to avoid rate limits</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</pre></div>
<div id="cell-14" class="clipboard-copy-txt">from litellm import completion
import time
# Generate a title for each topic
topic_titles = {}
for topic_id, keywords in top_words.items():
docs = docs_per_topic[topic_id][:5] # take 5 example docs
# Create a simple prompt with the documents and keywords
prompt = f"""
I have a topic that contains the following documents:
{docs}
The topic is described by the following keywords: {keywords}
Based on the above information, please provide a concise title for this topic.
Format:
title: <title>
"""
# Call gpt-4o-mini directly
response = completion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
# Extract the title
response_text = response.choices[0].message.content
title = response_text.split("title:")[-1].strip() if "title:" in response_text.lower() else "Unknown Title"
topic_titles[topic_id] = title
print(f"📌 Topic {topic_id}: {title}")
# Add delay to avoid rate limits
time.sleep(3)</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell-outputWrapper">
<div class="jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser">
</div>
<div class="jp-OutputArea jp-Cell-outputArea">
<div class="jp-OutputArea-child">
<div class="jp-OutputPrompt jp-OutputArea-prompt"></div>
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>📌 Topic 0: China and the Impact of Government and Elections
📌 Topic 1: "Current Trends in Music and Film Technology"
📌 Topic 2: Greek Athletes and Doping Controversies in Athens
📌 Topic 3: Lewis Holmes: European Indoor World Champion in Athletics
📌 Topic 4: "Australian Open Final: Nadal's Match Victory"
📌 Topic 5: Premier League Football Clubs: Chelsea, United, Arsenal, and Liverpool Insights
📌 Topic 6: "Rugby Nations: Wales and its Place Amongst England, Ireland, and France"
</pre>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
<h2 id="conclusion-bertopic-custom-topic-modeling-uncovered">✅ Conclusion: BERTopic & Custom Topic Modeling Uncovered<a class="anchor-link" href="#conclusion-bertopic-custom-topic-modeling-uncovered">¶</a></h2><p>In this notebook, we explored BERTopic approach to topic modeling:</p>
<hr />
<h3 id="1-using-bertopic-llms-litellm-gpt-4o">🔍 1. Using BERTopic + LLMs (LiteLLM + GPT-4o)<a class="anchor-link" href="#1-using-bertopic-llms-litellm-gpt-4o">¶</a></h3><p>We began by leveraging <a href="https://github.com/MaartenGr/BERTopic"><strong>BERTopic</strong></a>, a high-level, production-ready tool that:</p>
<ul>
<li>Uses <strong>sentence embeddings</strong> and <strong>density-based clustering</strong> (UMAP + HDBSCAN).</li>
<li>Allows easy interpretation of topics through <strong>keyword extraction</strong> and <strong>LLM-generated titles/summaries</strong> via LiteLLM.</li>
</ul>
<p>It enabled us to quickly and meaningfully explore the <strong>BBC News dataset</strong>, generating high-quality clusters and intuitive representations with minimal effort.</p>
<hr />
<h3 id="2-rebuilding-the-pipeline-from-scratch">🔬 2. Rebuilding the Pipeline from Scratch<a class="anchor-link" href="#2-rebuilding-the-pipeline-from-scratch">¶</a></h3><p>Then, we took a deep dive to <strong>understand what BERTopic does under the hood</strong>, step by step:</p>
<ol>
<li>Computed <strong>MiniLM embeddings</strong> of documents.</li>
<li>Applied <strong>PCA + UMAP</strong> for dimensionality reduction.</li>
<li>Used <strong>HDBSCAN</strong> for clustering documents into coherent groups.</li>
<li>Extracted <strong>keywords per cluster</strong> using <code>c-TF-IDF</code>.</li>
<li>Used <strong>GPT-4o via LiteLLM</strong> to generate human-readable <strong>titles</strong>.</li>
</ol>
<hr />
<h3 id="why-it-matters">🧠 Why It Matters<a class="anchor-link" href="#why-it-matters">¶</a></h3><p>This dual approach gave us:</p>
<ul>
<li>A powerful plug-and-play experience with BERTopic.</li>
<li>A deeper understanding of how topic modeling pipelines work, and how we can customize or extend them for our own needs.</li>
</ul>
<p>From quick exploration to full interpretability, <strong>topic modeling + LLMs</strong> form a great toolkit for real-world text mining.</p>
<hr />
<p>Whether you want speed, control, or insight — this notebook sets the foundation for all three.</p>
</div>
</div>
</div>
</div>
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-InputPrompt jp-InputArea-prompt">
</div><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
</div>
</div>
</div>
</div>
</div> <!-- jp-Notebook -->
</div> <!-- jupyter-wrapper -->