“Make it work, make it right, make it fast.”
– Kent Beck
As I break into the next phase of my news anomaly-detection side project, I figure it would be useful to me to publicly summarize what I’ve done so far, and what will be implemented next!
The Current Process:
- Collect RSS headlines from a variety of sources
- Every hour, send the last 6h of data to an LLM
- Store this summary in the database and show the result to visitors
While this has been decently reliable, it cannot scale, doesn’t retain a sense of overarching narratives over time, and is only useful for getting a 1000ft view of what is happening globally. The system can only return information from the LLM’s training data and the immediate news headlines provided to it.
I’ve been hands-off since March finishing up a project at work, but with any luck, will be able to include the improvements discussed below imminently.
Submitting the Past 6h of Headlines¶
For two months, the following prompt has been reliably provided to Gemini1 every hour along with the past six hours of news headlines from around the globe:
Summarize the 5 most important topics below with a focus on stories and issues with the greatest political, economic, or global impact. Never report celebrity news or gossip. Order these 5 topics by urgency/importance. Use specific examples of people, countries, companies, actions, or orders when identifying trends. Be extremely brief with the title and summary - try to limit the text within the summary string to less than 800 characters if possible while remaining accurate. Also include a list of ‘key topics’ - a list of the people, countries, and topics disucssed in the articles provided. Always ensure that accurate and brief information is provided - we don’t want to waste the reader’s time. ALWAYS translate text to english. NEVER be vague - describe with a minimum of ‘who, where, when, what, why, how’ when possible. DO NOT include IDs in the title or summary text.
Providing a schema for the response makes extracting the insights from the LLM astoundingly easy - assuming the correct structure is returned. I have found that Google’s Gemini model is fairly consistent in returning a complete and correct result.
(def generation-config
{:temperature 1
:topK 40
:topP 0.95
:maxOutputTokens 8192
:responseMimeType "application/json"
:responseSchema
{:type "object"
:required ["topics" "key_topics"]
:properties
{:key_topics {:type "array" :items {:type "string"}}
:topics
{:type "array"
:items
{:type "object"
:required ["title" "summary" "related_ids"]
:properties
{:title {:type "string"}
:summary {:type "string"}
:related_ids {:type "array"
:items {:type "number"}}}}}}}})
This LLM generation config is provided in Clojure’s data notation -
when it is sent to Google it is converted to JSON which requires far
more characters over the wire, but is a standard. I only want to send
the id
and content
to the LLM to summarize, and we pull these from
the database with a simple Postgres query:
-- :name llm-get-latest-items :? :*
-- :doc retrieves an item record given the id
SELECT id,
LEFT (CONCAT(title, ': ', description), 500) AS content
FROM media.items
WHERE created > NOW() - interval '6 hours'
ORDER BY posted DESC
LIMIT :limit;
From here it’s a hop, skip, and a jump to send the schema, prompt, and headlines to Gemini. Below is the source code for performing this and saving the result to the database.
(def global-llm-model "gemini-2.0-flash")
(defn gemini-url [{:keys [model key]}]
(str "https://generativelanguage.googleapis.com/v1beta/models/" model ":generateContent?key=" key))
;; This function actually submits the POST request to Google
(defn summarize-news-request []
(let [dev? (or (:dev env) false)
limit (if dev? 500 3000)]
(log/info (str "Submitting summary request to Gemini with " limit " items."))
(client/post (gemini-url {:model global-llm-model :key (:gemini-key env)})
{:content-type :json
; the entire body is converted to JSON during submission
:body (json/write-str
{:contents
[{:role "user"
:parts [{:text (str news-summary-prompt (db/llm-get-latest-items {:limit limit}))}]}]
:generationConfig generation-config})})))
;; As the LLM returns the IDs related to each topic, and I don't show
;; the real IDs on the frontend, this function gets the numeric ID and
;; actual headline for each entry (to link users to related articles.)
(defn process-topics [topics]
(->> topics
(map (fn [topic] (assoc topic :related_items (map (fn [id] (db/get-item-for-llm {:id id})) (:related_ids topic)))))
(map (fn [topic] (assoc topic :uuid (random-uuid))))))
;; This function calls all of the other functions shown above
(defn summarize-news-to-db []
(log/info (str "Using LLM '" global-llm-model "' to summarize the last 6h of news."))
(try
(let [response (summarize-news-request)
body (json/read-json (:body response))
usage (:usageMetadata body)
text (get-in body [:candidates 0 :content :parts 0 :text] "{}")
text-edn (json/read-json text)
topics (:topics text-edn)
key-topics (:key_topics text-edn)]
(if (and topics key-topics)
(let [data {:topics (process-topics topics)
:key-topics key-topics}
; Here the JSON blobs with our result data and usage info are stored
summary (db/insert-six-hour-summary! {:usage_data usage
:summary_data data})]
(log/info (str "6h summary data saved to DB as ID " (:id summary))))))
(catch Exception e
(report-error e))))
…since I flipped all this tech ON, I have collected 1794 items (75 days’ worth) of hourly summaries. In the DBGate tool, it is easy to inspect the JSONB blobs that are included with each row.
Let’s take a peek at what this is costing us by peeking at the
usage_data
for one of these summary items:
{
"totalTokenCount": 301017,
"promptTokenCount": 299477,
"promptTokensDetails": [
{
"modality": "TEXT",
"tokenCount": 299477 // 300k input tokens
}
],
"candidatesTokenCount": 1540,
"candidatesTokensDetails": [
{
"modality": "TEXT",
"tokenCount": 1540 // 1540 output tokens
}
]
}
This amounts to $0
of usage. The Gemini API1 is actually
free2 - with the caveat that the input data is harvested by Google
to better understand how to build their product. On the paid tier,
each of these requests would amount to $0.03
of input tokens and
$0.000616
of output tokens, which are four times more expensive.
A Preview of the Returned Data¶
Here is one of the five ’topics’ returned. Note that the
related_items
section is added after the fact by the
process_topics
function which performs a simple id
lookup.
{
"uuid": "d284cc42-9c67-434e-b15d-a1b4e31e6308",
"title": "Russia-Ukraine & Global Politics",
"summary": "Putin to skip Ukraine-Russia talks in Turkey, sending
negotiators instead. EU considers new sanctions against
Russia. Trump seeks Qatar's help to push Iran into a
nuclear agreement. Lula asks Putin to meet with Zelensky.
Top US anti-abortion groups are now pressuring the Trump
admin to ban mifepristone.",
"related_ids": [
351030539,
351020382,
350990116,
350989139,
350974967,
350968312,
350957563,
350947054,
350947066
],
"related_items": [
{
"id": 351030539,
"uuid": "a81e9aa9-6535-4d78-8901-ef8fd821fca5",
"title": "Putin To Skip Russia-Ukraine Negotiation Talks In Turkey"
},
{
"id": 351020382,
"uuid": "11551c01-aeff-4f27-a7e3-2588e4a55c70",
"title": "Украина и США скоординировали позиции перед переговорами в Стамбуле"
},
null, //<-- This ID must have been hallucinated by Gemini.
{
"id": 350989139,
"uuid": "df19d42f-e829-429c-857d-968beefd6799",
"title": "Putin to skip Ukraine talks, Russian team includes seasoned negotiators"
},
{
"id": 350974967,
"uuid": "3450691b-a688-40b9-8fa4-71a1cdd382dc",
"title": "\"Відмова від миру має свою ціну\": Сибіга зустрівся з Рубіо в Туреччині"
},
{
"id": 350968312,
"uuid": "54a1c80c-1f38-4814-8224-4145e82b3bc8",
"title": "‘We should all get into the ring’ - Bohemians boss
backs Roddy Collins’ feud-busting suggestion"
},
{
"id": 350957563,
"uuid": "7579e7d1-eaba-4f78-8f83-3097e49d610a",
"title": "В Швейцарии рассказали о ловушке для Зеленского"
},
{
"id": 350947054,
"uuid": "218b16b4-3de0-4a8b-8b53-513d58f92a0d",
"title": "Pope Leo says he will make 'every effort' for world peace"
},
{
"id": 350947066,
"uuid": "2666ee20-9cf6-48af-bffc-b3725c7ce6fc",
"title": "Trump administration rescinds curbs on AI chip exports to foreign markets"
}
]
}
Notably, three of these related articles are not in English. Per the database, the language code I was able to tag them with were RU and UK (unsuriprisingly).
[
"title": "Украина и США скоординировали позиции перед переговорами в Стамбуле"
"title": "\"Відмова від миру має свою ціну\": Сибіга зустрівся з Рубіо в Туреччині"
"title": "В Швейцарии рассказали о ловушке для Зеленского"
]
Translated, they read:
[
"title": "Ukraine and the US coordinated positions ahead of talks in Istanbul",
"title": "\"Refusal of peace has its price\": Sybiha met with Rubio in Turkey",
"title": "Switzerland spoke about a trap for Zelensky"
]
It is fascinating to me that LLMs are able to recognize information
across languages. Even though I can correctly tag incoming
languages3, it will be difficult to break from reliance on LLMs to
correlate topics across languages without first translating everything
to English (which may be the correct solution - Google’s
Cloud
Translation API
is priced at
$20
per million characters, though an imprecise LLM translation
would be an order of magnitude or two cheaper).
Unfortunately, there are a few interesting problems with the data above. For starters, there is an Irish Mirror article included that doesn’t seem to be even loosely related to the war in Ukraine. I suspect this may be the LLM misreading a nearby ID, and I wonder if there is a way to provide an input schema.
The last piece of data returned each call is an overall list of topics. Below is the example for this hour. I wonder if it would look interesting or be useful at all if I were to graph the rough frequency of these terms over time.
[
"Trump",
"Ukraine",
"Russia",
"Japan",
"India",
"Harvard",
"Starmer",
"Boeing",
"Qatar",
"Cosmetics",
"Diddy",
"Sex Crimes",
"AI",
"Streaming",
"Bologna",
"AC Milan",
"Coppa Italia",
"Wages",
"Japan Wage Policy",
"Inflation",
"Tren De Aragua",
"Deportation",
"Kadoya Sesame Mills",
"Takemoto Oil & Fat",
"Cartel",
"AI Women",
"Okinawa National College of Technology",
"DCON"
]
New Features & Next Steps¶
In order to apply an anomaly detection algorithm like ARIMA,4 I first need to be able to extract and store key topics from the incoming headlines, preferably without an LLM. Here is a brainstorming board with tools and methods to accomplish this:
Topic Identification with CoreNLP¶
One of the primary goals of the project is self-reliance. Using a weak LLM from Google is not optimal - the more detection and analysis that I can offload to local systems, the better. I am already using CLD5 for local language detection, but I still need to figure out a cost-effective way to deploy local LLMs and translation services.6
CoreNLP is a second step towards technological self-reliance:
CoreNLP is your one-stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian, and Spanish.
…all potentially useful features to leverage in this project.
While CoreNLP offers a robust Java library for directly integrating
execution into an application, I would rather decouple these
components of the Observer system and allow CoreNLP to be used by
other microservices. Conveniently, I already have a networked and
reverse-proxied Alpine Linux server with Docker ready. Setting up a
CoreNLP service should be a snap with an independent docker-compose
project:
services:
corenlp:
image: nlpbox/corenlp
restart: unless-stopped
ports:
- "7243:9000"
# ^^ Obviously this port is *not exposed* to the internet
# and *MUST* be safely kept behind proxies and firewalls.
environment:
JAVA_XMX: "4g"
networks:
- internal-network
networks:
internal-network:
name: core-nlp-network
driver: bridge
To test the deployment, you (if you are following along) can POST
a
query to the new service:
wget --post-data 'The quick brown fox jumped over the lazy dog.'\
'10.0.0.107:9209/?properties={"annotators":"tokenize,pos,lemma,depparse,natlog,openie",\
"outputFormat":"json"}' -O -
This returns a number of subject-relation-object triples like:
{
"subject": "quick brown fox",
"subjectSpan": [1, 4],
"relation": "jumped over",
"relationSpan": [4, 6],
"object": "lazy dog",
"objectSpan": [7, 9]
}
Awesome, we are ready to integrate!
Topic Correlation Graphs¶
A second outcome for the CoreNLP integration is the construction of a graph with topic associations. I will need to store the relations between subjects and objects. This may be as simple as storing the output from the ‘openie’ annotator, as seen below:
"input": "Putin To Skip Russia-Ukraine Negotiation Talks In Turkey",
"openie": [
{
"subject": "Putin",
"subjectSpan": [0, 1],
"relation": "Skip",
"relationSpan": [2, 3],
"object": "Russia Ukraine Negotiation Talks",
"objectSpan": [3, 8]
},
{
"subject": "Putin",
"subjectSpan": [0, 1],
"relation": "Skip Russia Ukraine Negotiation Talks In",
"relationSpan": [2, 9],
"object": "Turkey",
"objectSpan": [9, 10]
}
]
As this data flows through the system, we could take a running count of each topic as articles are ingested.
CREATE TABLE topics (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT NOT NULL UNIQUE);
CREATE TABLE daily_topic_freqs (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
topic_id BIGINT NOT NULL REFERENCES topics(id) ON DELETE CASCADE,
day DATE NOT NULL,
count INTEGER NOT NULL CHECK (count >= 0),
UNIQUE (topic_id, day));
Optionally, we could store these triples directly and perform additional analytics from the raw data. This may provide more opportunity for pivoting in the future, but will also take up more space in the database.
CREATE TABLE extracted_openie_triples (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
item_id BIGINT NOT NULL REFERENCES media.item(id) ON DELETE CASCADE,
subject TEXT NOT NULL,
relation TEXT NOT NULL,
object TEXT NOT NULL,
source_text TEXT,
extracted_on DATE NOT NULL DEFAULT CURRENT_DATE,
count BIGINT NOT NULL DEFAULT 1 CHECK (count >= 1));
The visualization I would want to see is something like a moving average for topics with the greatest movement and impact.
Anomaly Detection¶
From the above data, we ought to be able to complete anomaly detection. A simple initial application of Moving Average7 + Z-Score8 should be more than sufficient to prove out this idea.
Stay tuned for the next post, when hopefully all this work will be done!
-R
Google AI for Developers: ai.google.dev ↩︎ ↩︎
Gemini 2.0 Flash Pricing: ai.google.dev ↩︎
Github: Shuyo / Language-Detection ↩︎
“Autoregressive Integrated Moving Average (ARIMA) Prediction Model” investopedia.com ↩︎
Github: LibreTranslate ↩︎
“Moving Average” wikipedia.com ↩︎