“Make it work, make it right, make it fast.”
– Kent Beck

As I break into the next phase of my news anomaly-detection side project, I figure it would be useful to me to publicly summarize what I’ve done so far, and what will be implemented next!

The Current Process:

Collect RSS headlines from a variety of sources
Every hour, send the last 6h of data to an LLM
Store this summary in the database and show the result to visitors

While this has been decently reliable, it cannot scale, doesn’t retain a sense of overarching narratives over time, and is only useful for getting a 1000ft view of what is happening globally. The system can only return information from the LLM’s training data and the immediate news headlines provided to it.

I’ve been hands-off since March finishing up a project at work, but with any luck, will be able to include the improvements discussed below imminently.

ToC

Submitting the Past 6h of Headlines¶

For two months, the following prompt has been reliably provided to Gemini¹ every hour along with the past six hours of news headlines from around the globe:

Summarize the 5 most important topics below with a focus on stories and issues with the greatest political, economic, or global impact. Never report celebrity news or gossip. Order these 5 topics by urgency/importance. Use specific examples of people, countries, companies, actions, or orders when identifying trends. Be extremely brief with the title and summary - try to limit the text within the summary string to less than 800 characters if possible while remaining accurate. Also include a list of ‘key topics’ - a list of the people, countries, and topics disucssed in the articles provided. Always ensure that accurate and brief information is provided - we don’t want to waste the reader’s time. ALWAYS translate text to english. NEVER be vague - describe with a minimum of ‘who, where, when, what, why, how’ when possible. DO NOT include IDs in the title or summary text.

Providing a schema for the response makes extracting the insights from the LLM astoundingly easy - assuming the correct structure is returned. I have found that Google’s Gemini model is fairly consistent in returning a complete and correct result.

(def generation-config
  {:temperature 1
   :topK 40
   :topP 0.95
   :maxOutputTokens 8192
   :responseMimeType "application/json"
   :responseSchema
   {:type "object"
    :required ["topics" "key_topics"]
    :properties
    {:key_topics {:type "array" :items {:type "string"}}
     :topics
     {:type "array"
      :items
      {:type "object"
       :required ["title" "summary" "related_ids"]
       :properties
       {:title {:type "string"}
        :summary {:type "string"}
        :related_ids {:type "array"
                      :items {:type "number"}}}}}}}})

This LLM generation config is provided in Clojure’s data notation - when it is sent to Google it is converted to JSON which requires far more characters over the wire, but is a standard. I only want to send the id and content to the LLM to summarize, and we pull these from the database with a simple Postgres query:

-- :name llm-get-latest-items :? :*
-- :doc retrieves an item record given the id
SELECT id,
  LEFT (CONCAT(title, ': ', description), 500) AS content
FROM media.items
WHERE created > NOW() - interval '6 hours'
ORDER BY posted DESC
LIMIT :limit;

From here it’s a hop, skip, and a jump to send the schema, prompt, and headlines to Gemini. Below is the source code for performing this and saving the result to the database.

(def global-llm-model "gemini-2.0-flash")

(defn gemini-url [{:keys [model key]}]
  (str "https://generativelanguage.googleapis.com/v1beta/models/" model ":generateContent?key=" key))


;; This function actually submits the POST request to Google
(defn summarize-news-request []
  (let [dev? (or (:dev env) false)
        limit (if dev? 500 3000)]
    (log/info (str "Submitting summary request to Gemini with " limit " items."))
    (client/post (gemini-url {:model global-llm-model :key (:gemini-key env)})
                 {:content-type :json
                  ; the entire body is converted to JSON during submission
                  :body (json/write-str
                         {:contents
                          [{:role "user"
                            :parts [{:text (str news-summary-prompt (db/llm-get-latest-items {:limit limit}))}]}]
                          :generationConfig generation-config})})))

;; As the LLM returns the IDs related to each topic, and I don't show
;; the real IDs on the frontend, this function gets the numeric ID and
;; actual headline for each entry (to link users to related articles.)
(defn process-topics [topics]
  (->> topics
       (map (fn [topic] (assoc topic :related_items (map (fn [id] (db/get-item-for-llm {:id id})) (:related_ids topic)))))
       (map (fn [topic] (assoc topic :uuid (random-uuid))))))

;; This function calls all of the other functions shown above
(defn summarize-news-to-db []
  (log/info (str "Using LLM '" global-llm-model "' to summarize the last 6h of news."))
  (try
    (let [response (summarize-news-request)
          body (json/read-json (:body response))
          usage (:usageMetadata body)
          text (get-in body [:candidates 0 :content :parts 0 :text] "{}")
          text-edn (json/read-json text)
          topics (:topics text-edn)
          key-topics (:key_topics text-edn)]

      (if (and topics key-topics)
        (let [data {:topics (process-topics topics)
                    :key-topics key-topics}
                      ; Here the JSON blobs with our result data and usage info are stored
              summary (db/insert-six-hour-summary! {:usage_data usage
                                                    :summary_data data})]

          (log/info (str "6h summary data saved to DB as ID " (:id summary))))))

    (catch Exception e
      (report-error e))))

…since I flipped all this tech ON, I have collected 1794 items (75 days’ worth) of hourly summaries. In the DBGate tool, it is easy to inspect the JSONB blobs that are included with each row.

Let’s take a peek at what this is costing us by peeking at the usage_data for one of these summary items:

{
  "totalTokenCount": 301017,
  "promptTokenCount": 299477,
  "promptTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 299477  // 300k input tokens
    }
  ],
  "candidatesTokenCount": 1540,
  "candidatesTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 1540  // 1540 output tokens
    }
  ]
}

This amounts to $0 of usage. The Gemini API¹ is actually free² - with the caveat that the input data is harvested by Google to better understand how to build their product. On the paid tier, each of these requests would amount to $0.03 of input tokens and $0.000616 of output tokens, which are four times more expensive.

A Preview of the Returned Data¶

Here is one of the five ’topics’ returned. Note that the related_items section is added after the fact by the process_topics function which performs a simple id lookup.

{
  "uuid": "d284cc42-9c67-434e-b15d-a1b4e31e6308",
  "title": "Russia-Ukraine & Global Politics",
  "summary": "Putin to skip Ukraine-Russia talks in Turkey, sending
              negotiators instead. EU considers new sanctions against
              Russia. Trump seeks Qatar's help to push Iran into a
              nuclear agreement.  Lula asks Putin to meet with Zelensky.
              Top US anti-abortion groups are now pressuring the Trump
              admin to ban mifepristone.",
  "related_ids": [
    351030539,
    351020382,
    350990116,
    350989139,
    350974967,
    350968312,
    350957563,
    350947054,
    350947066
  ],
  "related_items": [
    {
      "id": 351030539,
      "uuid": "a81e9aa9-6535-4d78-8901-ef8fd821fca5",
      "title": "Putin To Skip Russia-Ukraine Negotiation Talks In Turkey"
    },
    {
      "id": 351020382,
      "uuid": "11551c01-aeff-4f27-a7e3-2588e4a55c70",
      "title": "Украина и США скоординировали позиции перед переговорами в Стамбуле"
    },
    null,   //<-- This ID must have been hallucinated by Gemini.
    {
      "id": 350989139,
      "uuid": "df19d42f-e829-429c-857d-968beefd6799",
      "title": "Putin to skip Ukraine talks, Russian team includes seasoned negotiators"
    },
    {
      "id": 350974967,
      "uuid": "3450691b-a688-40b9-8fa4-71a1cdd382dc",
      "title": "\"Відмова від миру має свою ціну\": Сибіга зустрівся з Рубіо в Туреччині"
    },
    {
      "id": 350968312,
      "uuid": "54a1c80c-1f38-4814-8224-4145e82b3bc8",
      "title": "‘We should all get into the ring’ - Bohemians boss
                  backs Roddy Collins’ feud-busting suggestion"
    },
    {
      "id": 350957563,
      "uuid": "7579e7d1-eaba-4f78-8f83-3097e49d610a",
      "title": "В Швейцарии рассказали о ловушке для Зеленского"
    },
    {
      "id": 350947054,
      "uuid": "218b16b4-3de0-4a8b-8b53-513d58f92a0d",
      "title": "Pope Leo says he will make 'every effort' for world peace"
    },
    {
      "id": 350947066,
      "uuid": "2666ee20-9cf6-48af-bffc-b3725c7ce6fc",
      "title": "Trump administration rescinds curbs on AI chip exports to foreign markets"
    }
  ]
}

Notably, three of these related articles are not in English. Per the database, the language code I was able to tag them with were RU and UK (unsuriprisingly).

[
  "title": "Украина и США скоординировали позиции перед переговорами в Стамбуле"
  "title": "\"Відмова від миру має свою ціну\": Сибіга зустрівся з Рубіо в Туреччині"
  "title": "В Швейцарии рассказали о ловушке для Зеленского"
]

Translated, they read:

[
  "title": "Ukraine and the US coordinated positions ahead of talks in Istanbul",
  "title": "\"Refusal of peace has its price\": Sybiha met with Rubio in Turkey",
  "title": "Switzerland spoke about a trap for Zelensky"
]

It is fascinating to me that LLMs are able to recognize information across languages. Even though I can correctly tag incoming languages³, it will be difficult to break from reliance on LLMs to correlate topics across languages without first translating everything to English (which may be the correct solution - Google’s Cloud Translation API is priced at $20 per million characters, though an imprecise LLM translation would be an order of magnitude or two cheaper).

Unfortunately, there are a few interesting problems with the data above. For starters, there is an Irish Mirror article included that doesn’t seem to be even loosely related to the war in Ukraine. I suspect this may be the LLM misreading a nearby ID, and I wonder if there is a way to provide an input schema.

The last piece of data returned each call is an overall list of topics. Below is the example for this hour. I wonder if it would look interesting or be useful at all if I were to graph the rough frequency of these terms over time.

[
  "Trump",
  "Ukraine",
  "Russia",
  "Japan",
  "India",
  "Harvard",
  "Starmer",
  "Boeing",
  "Qatar",
  "Cosmetics",
  "Diddy",
  "Sex Crimes",
  "AI",
  "Streaming",
  "Bologna",
  "AC Milan",
  "Coppa Italia",
  "Wages",
  "Japan Wage Policy",
  "Inflation",
  "Tren De Aragua",
  "Deportation",
  "Kadoya Sesame Mills",
  "Takemoto Oil & Fat",
  "Cartel",
  "AI Women",
  "Okinawa National College of Technology",
  "DCON"
]

New Features & Next Steps¶

In order to apply an anomaly detection algorithm like ARIMA,⁴ I first need to be able to extract and store key topics from the incoming headlines, preferably without an LLM. Here is a brainstorming board with tools and methods to accomplish this:

Topic Identification with CoreNLP¶

One of the primary goals of the project is self-reliance. Using a weak LLM from Google is not optimal - the more detection and analysis that I can offload to local systems, the better. I am already using CLD⁵ for local language detection, but I still need to figure out a cost-effective way to deploy local LLMs and translation services.⁶

CoreNLP is a second step towards technological self-reliance:

CoreNLP is your one-stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian, and Spanish.

…all potentially useful features to leverage in this project.

While CoreNLP offers a robust Java library for directly integrating execution into an application, I would rather decouple these components of the Observer system and allow CoreNLP to be used by other microservices. Conveniently, I already have a networked and reverse-proxied Alpine Linux server with Docker ready. Setting up a CoreNLP service should be a snap with an independent docker-compose project:

services:
  corenlp:
    image: nlpbox/corenlp
    restart: unless-stopped
    ports:
      - "7243:9000"
      #  ^^ Obviously this port is *not exposed* to the internet
      #     and *MUST* be safely kept behind proxies and firewalls.
    environment:
      JAVA_XMX: "4g"
    networks:
      - internal-network

networks:
  internal-network:
    name: core-nlp-network
    driver: bridge

To test the deployment, you (if you are following along) can POST a query to the new service:

wget --post-data 'The quick brown fox jumped over the lazy dog.'\
    '10.0.0.107:9209/?properties={"annotators":"tokenize,pos,lemma,depparse,natlog,openie",\
                                  "outputFormat":"json"}' -O -

This returns a number of subject-relation-object triples like:

{
  "subject": "quick brown fox",
  "subjectSpan": [1, 4],
  "relation": "jumped over",
  "relationSpan": [4, 6],
  "object": "lazy dog",
  "objectSpan": [7, 9]
}

Awesome, we are ready to integrate!

Topic Correlation Graphs¶

A second outcome for the CoreNLP integration is the construction of a graph with topic associations. I will need to store the relations between subjects and objects. This may be as simple as storing the output from the ‘openie’ annotator, as seen below:

"input": "Putin To Skip Russia-Ukraine Negotiation Talks In Turkey",
"openie": [
  {
    "subject": "Putin",
    "subjectSpan": [0, 1],
    "relation": "Skip",
    "relationSpan": [2, 3],
    "object": "Russia Ukraine Negotiation Talks",
    "objectSpan": [3, 8]
  },
  {
    "subject": "Putin",
    "subjectSpan": [0, 1],
    "relation": "Skip Russia Ukraine Negotiation Talks In",
    "relationSpan": [2, 9],
    "object": "Turkey",
    "objectSpan": [9, 10]
  }
]

As this data flows through the system, we could take a running count of each topic as articles are ingested.

CREATE TABLE topics (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    name TEXT NOT NULL UNIQUE);

CREATE TABLE daily_topic_freqs (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    topic_id BIGINT NOT NULL REFERENCES topics(id) ON DELETE CASCADE,
    day DATE NOT NULL,
    count INTEGER NOT NULL CHECK (count >= 0),
    UNIQUE (topic_id, day));

Optionally, we could store these triples directly and perform additional analytics from the raw data. This may provide more opportunity for pivoting in the future, but will also take up more space in the database.

CREATE TABLE extracted_openie_triples (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    item_id BIGINT NOT NULL REFERENCES media.item(id) ON DELETE CASCADE,
    subject TEXT NOT NULL,
    relation TEXT NOT NULL,
    object TEXT NOT NULL,
    source_text TEXT,
    extracted_on DATE NOT NULL DEFAULT CURRENT_DATE,
    count BIGINT NOT NULL DEFAULT 1 CHECK (count >= 1));

The visualization I would want to see is something like a moving average for topics with the greatest movement and impact.

Anomaly Detection¶

From the above data, we ought to be able to complete anomaly detection. A simple initial application of Moving Average⁷ + Z-Score⁸ should be more than sufficient to prove out this idea.

Stay tuned for the next post, when hopefully all this work will be done!

-R

Google AI for Developers: ai.google.dev ↩︎ ↩︎
Gemini 2.0 Flash Pricing: ai.google.dev ↩︎
Github: Shuyo / Language-Detection ↩︎
“Autoregressive Integrated Moving Average (ARIMA) Prediction Model” investopedia.com ↩︎
Github: darkrone / cld - Clojure Language Detection ↩︎
Github: LibreTranslate ↩︎
“Moving Average” wikipedia.com ↩︎
“Z-Score” jmp.com ↩︎

Observer Devlog 2: Brainstorming