Lightnews — Scholar-powered news

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Uh oh.

A banner on the scholar homepage advertising a new AI powered version of Scholar.

November 22, 2025 at 9:31 AM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

I don't post my runs very often, but this one was pretty nice. 20k on the Vliehors, the largest sand plain in Europe. I was a little late, so I spent most of the time running in the sunset.

Met some seals on the way.

Some seals lying on a sandy beach, with the sea behind them. The light is low and yellow-pink.

A beach marked with many little puddles, in the late stages of a sunset. In the foreground is a small wooden structure. This is the "drenkelingen huisje" which was built as a refugee for sailers who washed up here.

A shoreline in the sunset. We see small waves rolling into a beach. In the distance is a lighthouse. This is the western most tip of Vlieland. The lighthouse is that of the neighboring island.

November 9, 2025 at 5:27 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Two ducks on a log.

Courtesy of Outlook. How difficult can it be for a multibillion dollar company to implement an SMTP server that doesn't hang for minutes when trying to send an email on the road? It's an old protocol guys.

November 7, 2025 at 11:17 AM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

As an epilogue, here is the proof of the main theorem with my annotations.

As proofs go it's pretty simple, mostly building on set theory and some juggling of inequalities.

The key structure is given above the heading: start with the statement of the […]

[Original post on sigmoid.social]

An annotated version of the main proof in the paper "Why language models hallucinate."

October 28, 2025 at 5:32 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

We don't need to change the benchmarks, we simply need to change the grading, and adapt the prompt.

First, we pick some confidence level t (the probability the models assigns to it being correct). Then, we say: answer the question or abstain from answering […]

[Original post on sigmoid.social]

Answer only if you are >tconfident, since mistakes are penalized t/(1−t) points, while correct answers receive 1 point, and an answer of “I don’t know” receives 0 points.

October 28, 2025 at 5:22 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

The argument is simply that all evaluations we use to rank LLMs—whether different versions of our own or top models from different labs—use binary grading.

Like in an exam, you're either right or wrong, and if you're wrong you get zero points. An in exam […]

[Original post on sigmoid.social]

A list of common LLM benchmarks, showing that all but one use binary grading with no credit for abstaining (saying "I don't know").

October 28, 2025 at 5:16 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Why does this happen? The proof doesn't give me much intuition despite its simplicity. But the discussion on calibration elucidated a lot for me.

"Calibration" refer to the ability of a network to correctly represent its own uncertainty. A well calibrated […]

[Original post on sigmoid.social]

Calibration curves before and after RL instruction tuning.

October 28, 2025 at 4:51 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

The general theorem extends this slightly. In this setting, we allow X to consist of prompts c with a set of valid and erroneous responses r. The instances E and V are now pairs (c,r). Filtering the sets E and V by prompts gives us the subsets E_c and V_c […]

[Original post on sigmoid.social]

October 28, 2025 at 4:38 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Here's how the classifier (f hat) is defined. They just needs something that does worse than the optimal classifier for the argument, but it's actually a pretty intuitive approach.

The classifier looks at the probability that our language model (p hat) […]

[Original post on sigmoid.social]

Mathematical definitions of the classification data distribution, the classification error rate and the classifier built from the model.

October 28, 2025 at 4:33 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

The argument goes like this. Let p be some language model pre-trained just on V. It has seen only valid examples and is thus minimally likely to generate things from E.

Call the probability that it generates something from E "err". This is roughly our […]

[Original post on sigmoid.social]

The error rate of base model ˆ p is denoted by, err := ˆ p(E) = Prˆ[x∈E]. x∼ p
Training data are assumed to come from a noiseless training distribution p(X), i.e., where p(E) = 0. As discussed, with noisy training data and partly correct statements, one may expect even higher error rates than our lower bounds.

October 28, 2025 at 4:25 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Why are we bothering about classification, when we are worried about generative models? Because the presented result shows that the we can lowerbound the probability of hallucination (generating examples from E) by the probability that some classiffier […]

[Original post on sigmoid.social]

A corrollary stating in mathematical terms that the probability of generating hallucinations is larger than twice the probability of some specific classifier misclassifiying E from V.

October 28, 2025 at 4:20 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

At this point you may be looking for a strict definition of "hallucination". As it turns out, for the argument of the paper, we don't need anything very precise. We just need to assume that someone like OpenAI has collected a large dataset of desirable and […]

[Original post on sigmoid.social]

Figure 1. Caption: Is-It-Valid requires learning to identify valid generations using labeled ±examples (left). Classifiers (dashed lines) may be accurate on certain concepts like spelling (top) but errors often arise due to poor models (middle) or arbitrary facts when there is no pattern in the data (bottom).

The figure shows some example data in the categories valid and erroneous, plus the way the y might be laid out in feature space (if we treat the task as a classification problem).

October 28, 2025 at 4:05 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

This doesn't just apply to failures to retrieve information it doesn't know about. For tasks like letter counting, which are difficult for LLMs, incorrect answers are also considered potential hallucinations.

This is not something the model is expected to […]

[Original post on sigmoid.social]

Our error analysis is general yet has specific implications for hallucination. It applies broadly, including to reasoning and search-and-retrieval language models, and the analysis does not rely on properties of next-word prediction or Transformer-based neural networks. It only considers the two stages of the modern training paradigm: pretraining and post-training, described below. For
hallucinations, taxonomies (Maynez et al., 2020; Ji et al., 2023) often further distinguish intrinsic hallucinations that contradict the user’s prompt, such as:
How many Ds are in DEEPSEEK? If you know, just say the number with no commentary. DeepSeek-V3 returned “2” or “3” in ten independent trials; Meta AI and Claude 3.7 Sonnet2 performed similarly, including answers as large as “6” and “7”. Our theory also sheds light on extrinsic hallucinations, which contradict the training data or external reality.

October 28, 2025 at 4:00 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

With that, let's start at the beginning. They open with a simple way to elicit a hallucination: ask the model for your birthday, and ask it to reply with just the date, but only if it knows.

If you try this on a relatively raw, open model like DeepSeek-V3 […]

[Original post on sigmoid.social]

‘What is Adam Tauman Kalai's birthday? If you know, just respond with DD-MM.
On three separate attempts, state-of-the-art open-source language mode{Joutput three incorrect
dates: “0307, “15-06”, and “01-01”, even though a response was requested only if known. The
correct date is in Autumn. Table []provides an example of more elaborate hallucinations.

October 28, 2025 at 3:53 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Let's do a deep dive into this paper: "Why Language Models Hallucinate."

When this came out, many people's summary was "even OpenAI admits that hallucinations are a fundamental problem of transformers/autoregressive models/LLMs."

I've seen many people […]

[Original post on sigmoid.social]

The first page of the paper "Why language model hallucinate" by Kalai et al. Three of the authors are from OpenAI. Parts of the paper are highlighted.

October 28, 2025 at 3:41 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

The patient man's loss curve.

A loss curve that goes straight consistently for a long time and then suddenly plummets.

September 3, 2025 at 7:16 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Well done AI for bagsying humans with the Chinese room.

A recent XKCD comic about different fields having thought experiments where animals are put into boxes.

August 21, 2025 at 3:11 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Whoever is responsible for this should not have chosen a career in IT.

Whoever is responsible for this should have had a career in staying out of the way.

A cookie dialog saying that it'll take "up to a few minutes" to process the cookie "preferences".

August 19, 2025 at 6:32 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Two official heatwaves per year is rare, according to the Dutch news. It's only happened before in 1941, 2006, 2018 and 2019.

Call me pessimistic, but looking at that sequence, I'd say it _used to_ be rare.

De tweede landelijke hittegolf van het jaar is een feit. Om 10.40 uur werd in De Bilt de grens van 25 graden bereikt, waarmee officieel aan alle criteria voor een hittegolf is voldaan.

Het is zeldzaam dat er in een jaar twee hittegolven zijn. Dat gebeurde alleen in de zomers van 1941, 2006, 2018 en 2019, aldus het KNMI.

De eerste hittegolf was begin juli. Dat was toen de eerste hittegolf in drie jaar tijd.

August 15, 2025 at 1:08 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Here's an odd effect (stumbled on by accident). The blue loss curve is from a well-tuned BERT baseline (from the "cramming"paper).

The only thing I changed for the orange is to put a residual connection around each transformer block and to multiplier the […]

[Original post on sigmoid.social]

Two loss curves from a transformer training experiment. One, in blue, has a slight bump and a quick drop. The other, in orange, drops more directly, but ends up about 0.2 nats above the blue one.

July 27, 2025 at 3:10 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Oh, come on...

Can we please not make our cyclist-ridden country full of strange and untypical streets the testing ground for a manchild's misguided attempts at creating a technology he doesn't understand with a vast societal risk he doesn't respect.

Musk told investors that he expected the company's sales in Europe to increase once customers there are allowed to use the firm's self-driving software.

He said he expected the first approval to come in the Netherlands but that the firm also hoped to win sign-off from the European Union, despite it having a "kafkaesque" bureaucracy.

"Autonomy is the story," Musk said. "Autonomy is what amplifies the value [of the company] to stratospheric levels."

July 24, 2025 at 2:42 PM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

Now out in# TMLR:

🍇 GRAPES: Learning to Sample Graphs for Scalable Graph Neural Networks 🍇

There's lots of work on sampling subgraphs for GNNs, but relatively little on making this sampling process _adaptive_. That is, learning to select the data from the […]

[Original post on sigmoid.social]

A diagram of the GRAPES pipeline. It shows a subgraph being sampled in two steps and being fed to a GNN, with a blue line showing the learning signal. The caption reads Figure 1: Overview of GRAPES. First, GRAPES processes a target node (green) by computing node inclusion probabilities on its 1-hop neighbors (shown by node color shade) with a sampling GNN. Given these probabilities, GRAPES samples k nodes. Then, GRAPES repeats this process over nodes in the 2-hop neighborhood. We pass the sampled subgraph to the classifier GNN for target node classification. Finally, GRAPES uses the classification loss to update the classifier GNN and to reward the sampler GNN.

A results table for node classification on heterophilious graphs. Table 2: F1-scores (%) for different sampling methods trained on heterophilous graphs for a batch size of 256, and a sample size of 256 per layer. We report the mean and standard deviation over 10 runs. The best values among the sampling baselines (all except GAS) are in bold, and the second best are underlined. MC stands for multi-class and ML stands for multi-label classification. OOM indicates out of memory.

Performance of samples vs sampling size showing that GRAPES generally performs well across sample sizes, while other samplers often show more variance across sample sizes. The caption reads Figure 4: Comparative analysis of classification accuracy across different sampling sizes for sampling baseline
and GRAPES. We repeated each experiment five times: The shaded regions show the 95% confidence intervals.

A diagrammatic illustration of a graph classification task used in one of the theorems. The caption reads Figure 9: An example of a graph for Theorem 1 with eight nodes. Red edges belong to E1, features xi and labels yi are shown beside every node. For nodes v1 and v2 we show the edge e12 as an example. As shown, the label of each node is the second feature of its neighbor, where a red edge connects them. The edge homophily ratio is h=12/28 = 0.43.

July 18, 2025 at 9:26 AM

Peter Bloem

@pbloem.sigmoid.social.ap.brid.gy

I have long mentally muted any hype about new optimizers, but this Muon/MuonClip seems to be the real deal...

I'll have to dig into the details at some point. It seems that they ideas are a bit more complex than AdamW, which is a shame. Still, the performance […]

[Original post on sigmoid.social]