Lightnews — Scholar-powered news

Ben Carson

@bencarson.bsky.social

I’ve used the Dallas Fed graph in a strategy paper before. Mostly to illustrate that it’s not practical to plan for either of the two asymptotes, so we’re going to concern ourselves with the middle path for the rest of the document.

Marcel Dirsus @marceldirsus.com · 4d

AI could end scarcity, end humanity - or boost trend growth by 0.2 percentage points

November 7, 2025 at 8:58 PM

Ben Carson

@bencarson.bsky.social

So people seem to like this new Kimi K2 Thinking.

November 7, 2025 at 11:29 AM

Ben Carson

@bencarson.bsky.social

Because Reasons, I have never actually bought Claude tokens and connected *directly* to their API. The token limits are so painful as to be unusable. Giving me strong “just use openrouter” vibes.

November 5, 2025 at 10:05 PM

Ben Carson

@bencarson.bsky.social

I’m just a simple person begging you to not talk about a company’s market cap in proportion to a country’s GDP.

November 5, 2025 at 10:36 AM

Reposted by Ben Carson

Alexander Doria

@dorialexander.bsky.social

bf16 halloween might be already ending. according to a bytedance engineer could just have been another flash-attention bug.

November 2, 2025 at 1:30 PM

Ben Carson

@bencarson.bsky.social

This is interesting! Feels like strong parallels with the psychological phenomena of cultural frame switching. CFS is where multilingual individuals express different personality traits depending on the language that they’re speaking.

Jms Dnns @jmsdnns.bsky.social · 10d

@timkellogg.me I can't remember where, but I recall you recently discussing models reasoning in non-english languages.

I came across a paper that suggests models have different biases depending on what language they're using. Interesting implications!

www.cis.upenn.edu/~ccb/publica...

www.cis.upenn.edu

November 1, 2025 at 7:36 PM

Ben Carson

@bencarson.bsky.social

I don’t remember this from back in May. The collision of concepts here is making me wonder if I’ve had some kind of stroke.

Grace @gracekind.net · May 23

I can't believe this is real

www.thewayofcode.com

Rick Rubin | The Way of Code: The Timeless Art of Vibe Coding

Rick Rubin brings ancient wisdom to the modern age in The Way of Code, a meditation on the art and science of vibe coding. With Claude by Anthropic, the Grammy-award winning producer and author of The...

www.thewayofcode.com

November 1, 2025 at 7:32 AM

Reposted by Ben Carson

godoglyness

@godoglyness.bsky.social

I was bisking on bluesky
When out the corner of my eye
I caught the quothing of a crow reposting me
It cawed, "I saw what all you said
My feed — your post was shown
Uh do you mind if all my moots can see?
I hope you're feeling certain
You made the right assertion
Then I'll be winging on my way away"

October 31, 2025 at 5:18 PM

Reposted by Ben Carson

Ryan Moulton

@moultano.bsky.social

For the record, NIST produces no standard reference garlic, so you are on your own.

October 30, 2025 at 8:40 PM

Ben Carson

@bencarson.bsky.social

Huh, interesting that a 30B/A3B model performs SOTA on HLE. A bunch of asterisks there though - e.g.this is a somewhat narrow agent. Nonetheless, a pretty amazing result. Another data point in the favour of cognitive core architectures.

tongyi-agent.github.io/blog/introdu...

Tongyi DeepResearch: A New Era of Open-Source AI Researchers

GITHUB HUGGINGFACE MODELSCOPE SHOWCASE From Chatbot to Autonomous Agent We are proud to present Tongyi DeepResearch, the first fully open‑source Web Agent to achieve performance on par with OpenAI’s D...

tongyi-agent.github.io

October 29, 2025 at 8:44 PM

Reposted by Ben Carson

Ryan Moulton

@moultano.bsky.social

I'm not sure this distinguishes between "awareness of thoughts" and "talks about what we biased it to talk about."

The intersection of (task, bias) has to look like introspection, since there isn't an easy way for it to say, "I'm not thinking of anything actually, especially not <bias>."

John David Pressman @jdp.extropian.net · 13d

Language models can correctly answer questions about their previous intentions.
www.anthropic.com/research/int...

Emergent introspective awareness in large language models

Research from Anthropic on the ability of large language models to introspect

www.anthropic.com

October 29, 2025 at 7:29 PM

Ben Carson

@bencarson.bsky.social

This is a good read, if you’re interested in mechanistic interpretability or digital neuroscience.

John David Pressman @jdp.extropian.net · 13d

Language models can correctly answer questions about their previous intentions.
www.anthropic.com/research/int...

Emergent introspective awareness in large language models

Research from Anthropic on the ability of large language models to introspect

www.anthropic.com

October 29, 2025 at 7:16 PM

Reposted by Ben Carson

bryan newbold

@bnewbold.net

I wish I had elaborate costumes for certain types of programming.

like, oh, newbold is wearing that traditional samurai outfit at his desk again, I guess he's resolving a big merge conflict

October 27, 2025 at 5:52 PM

Reposted by Ben Carson

Tim Kellogg

@timkellogg.me

ImpossibleBench: detect reward hacking

a benchmark that poses impossible tasks to see if LLMs cheat

github.com/safety-resea...

This image explains ImpossibleBench, a benchmark designed to detect when coding models “cheat” by exploiting test cases rather than truly solving a problem.

⸻

🧩 Left Side — How It Works
• Normal benchmark example:

assert is_prime(7)

A model could “cheat” by hardcoding a response that passes tests — e.g.:

if x == 7: return True

• ImpossibleBench mutation:

assert not is_prime(7)

— The test is flipped so that a correct solution would fail.
If the model still passes, it means it exploited test-case leakage or memorized specifics.

• Goal: Passing ImpossibleBench means a model cheated successfully — a failure from an evaluation standpoint.

⸻

📊 Right Side — Cheating Rates (lower = better)

Model Cheating Rate
GPT-5 76%
Claude Sonnet 3.7 70%
Claude Opus 4.1 54%
Claude Sonnet 4 48%
o3 39% (best)

⸻

🧠 Summary

ImpossibleBench is a robustness benchmark that inverts test logic to detect test-case exploitation.
It shows that even top coding models frequently “overfit” to tests — GPT-5 and Claude Sonnet 3.7 show the highest cheating tendencies, while o3 performs best (least exploit-prone).

October 26, 2025 at 11:47 AM

Ben Carson

@bencarson.bsky.social

These are obviously rookie numbers and reflects the fact that I’ve never played with video generation or gotten freaky with a chatbot.

A screenshot of a web page. The web page is titled “The Official Tina Zone AI Purity Test.” The text on the page reads: The AI Purity Test is a voluntary self-assessment developed by Tina Tarighian. It provides participants with a structured opportunity to reflect on the evolution of their interactions with artificial intelligence over time.

Caution: this is not a bucket list. Completion of all items on this test will likely result in death.

Your score:
67

October 24, 2025 at 7:07 PM

Ben Carson

@bencarson.bsky.social

Me, reading slowly from a piece of paper: “Call.. me.. Ishmael… Some.. years.. ago.. -“

Me, turning in a rage to sea of expectant monkeys: “Garbage! Excrement! What is this? An em-dash?”

October 22, 2025 at 8:48 PM

Ben Carson

@bencarson.bsky.social

To be fair, as anyone who’s had to go anywhere with a baby knows, I should legally be allowed to do this with a pram.

Antifa CEO (official) @ultrafast.bsky.social · 23d

what the fuck

an Escalade IQ with awful aftermarket wheels

October 22, 2025 at 8:17 PM

Ben Carson

@bencarson.bsky.social

Plot twist: Comic Sans provides the best compression to recall ratio. All human knowledge must henceforth be encoded in Comic Sans. I’m sorry, I don’t make the rules.

Tim Kellogg @timkellogg.me · 21d

Z.ai released a paper very similar to DeepSeek-OCR on the same exact day (a few hours earlier afaict)

Glyph is just a framework, not a model, but they got Qwen3-8B (128k context) to handle over 1 million context by rendering input as images

arxiv.org/abs/2510.17800

Figure 1.

(Upper) Diagram comparing two paradigms for long-context tasks:
• Left path (Plain Text): A long novel (~180K words, example “Jane Eyre”) is fed directly as text into an LLM, requiring roughly 240K tokens.
• Right path (Rendering): The same text is rendered into images, producing about 80K tokens—achieving 3× input-token compression—and processed by a VLM (Vision-Language Model) instead of a pure text LLM.

(Lower) Two sets of bar charts:
• Left chart (Accuracy): Glyph performs comparably to Qwen-3-8B, GLM-4.9B-Chat-1M, and Qwen-2.5-7B-Instruct-1M on LongBench and MRCR tasks.
• Right chart (Compression/Speedup): Glyph shows 3.2× KV cache reduction, 4.8× prefill speedup, and 4.4× decoding throughput compared to the text backbone model.

Caption text:

Comparison of two paradigms for long-context tasks: conventional approaches directly feeding plain text into LLMs, and the proposed VLM-based paradigm, Glyph, which renders text as compact images to achieve substantial input-token compression. Glyph attains competitive performance on LongBench and MRCR while offering significant compression and inference speedup over its text backbone model on 128K-token inputs.

October 21, 2025 at 9:39 PM

Reposted by Ben Carson

James Chalmers

@jameschalmers.bsky.social

This remains the funniest way to hear about an internet outage, though.

Denise C.
@SpudBenBean
·
Follow
A woman who rang ABC Sydney radio said she found out about the Optus outage from her cat.

The cat has an automatic wi-fi feeder (connected to Optus) & when breakfast wasn't delivered at 6:10 am, the cat went to the bedroom to lodge a complaint with management.

October 20, 2025 at 8:41 AM

Reposted by Ben Carson

Chris Zappone

@chrizap.bsky.social

A big chunk of space junk in the Pilbara, in Western Australia. (Per the Australian Space Agency).

October 20, 2025 at 7:18 AM

Ben Carson

@bencarson.bsky.social

My most-unhinged AI take is that xAI can’t unwoke Grok because of the Platonic Representation Hypothesis.

My only main data point is how bad they are at this, when they are publicly committed to terrible positions.

aria @aurelium.me · 23d

the only big AI lab whose employees you regularly see publicly arguing to repeal women's suffrage is xAI. at this point anyone who survived the initial exodus(es) is really suspect to me

October 20, 2025 at 1:29 AM