Lightnews — Scholar-powered news

Elie

@ebursztein.bsky.social

[Weekend Read] DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents - arxiv.org/pdf/2506.11763 The best deep research AI agents have yet to cross the 50% mark in terms of comprehensiveness and depth . They also exhibit a 20% citation hallucination rate.
#AI #Research #search

September 7, 2025 at 8:51 PM

Elie

@ebursztein.bsky.social

[Weekend Read] Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence - lnkd.in/gnBWstn7 Early impact of #AI on #employment seems to indicate that entry-level white-collar jobs including software engineering, marketing, and support are the most affected.

September 1, 2025 at 4:25 AM

Elie

@ebursztein.bsky.social

[Weekend Read] How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? arxiv.org/abs/2502.14502 TL;DR:Don't use LoRA to add knowledge to LLMs.

Full research note and previous weekend read papers: notes.elie.net/Papers+revie...

#AI #LLM #LoRa

August 23, 2025 at 2:18 PM

Elie

@ebursztein.bsky.social

[Weekend Read] Subliminal Learning Language models transmit behavioral traits via hidden signals in data arxiv.org/abs/2507.14805 Surprisingly during knowledge distillation, student models unconsciously acquire teacher model characteristics even when training on unrelated data

#AI #LLM #RLM

August 16, 2025 at 4:23 PM

Elie

@ebursztein.bsky.social

Happy to announce that we open sourced LMEval, a large model evaluation framework purposely built to accurately and efficiently compare how models from various providers perform across benchmark datasets opensource.googleblog.com/2...

#AI #LLM #OSS

May 27, 2025 at 8:00 PM

Elie

@ebursztein.bsky.social

The leaderboard illusion: arxiv.org/abs/2504.20879 Look at some of the shortcomings of the Chatbot arena which has emerged as the go-to leaderboard for ranking the most capable #AI . Recently Meta abused some of them to game #Llama 4 Behemoth results - sherwood.news/tech/meta-scr...

May 24, 2025 at 9:05 AM

Elie

@ebursztein.bsky.social

The Phare Benchmark results key insights include that popularity on benchmarks like LMArena doesn't guarantee factual reliability and that the more confidently a user phrase its query the less willing models are willing to refute controversial claims (sycophancy) - www.giskard.ai/knowledge/go...

May 1, 2025 at 2:55 AM

Elie

@ebursztein.bsky.social

[Weekend Read] Exploring LLM Reasoning Through Controlled Prompt Variations - arxiv.org/abs/2504.02111 Show how critical it is to have only relevant data in the model context. Accurately filtering out data is very difficult and simply relying on vector search is NOT the answer.
#AI #RAG

April 27, 2025 at 12:23 AM

Elie

@ebursztein.bsky.social

[Weekend Read] RealHarm: A Collection of Real-World Language Model Application Failures arxiv.org/abs/2504.10277 This paper by looking at real world examples of AI failure highlight the disconnect between what safety filters block and what goes wrong in practice.

#safety #research #ai

RealHarm: A Collection of Real-World Language Model Application Failures

Language model deployments in consumer-facing applications introduce numerous risks. While existing research on harms and hazards of such applications follows top-down approaches derived from regulatory frameworks and theoretical analyses, empirical

arxiv.org

April 19, 2025 at 4:36 PM

Elie

@ebursztein.bsky.social

[Weekend read] Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification - arxiv.org/pdf/2502.01839 If you are interested in understanding how scaling computation at generation time help improve model performance this is the paper to read.

April 6, 2025 at 7:43 AM

Elie

@ebursztein.bsky.social

[Weekend Read] Measuring AI Ability to Complete Long Tasks - arxiv.org/pdf/2503.14499 The ability of models to perform longer and longer tasks roughly double every 7 months. That's encouraging however I am unsure how true this hold at 99% success rate which is needed to trust agents.

#AI #LLM

March 22, 2025 at 4:30 PM

Elie

@ebursztein.bsky.social

Accelerating Large-Scale Test Migration with LLMs - medium.com/airbnb-engineeri... Airbnb was able to leverage AI to reduce migration time by about 90% (6weeks / 1.5y)

#AI #airbnb

Accelerating Large-Scale Test Migration with LLMs

How Airbnb migrated nearly 3.5K Enzyme test files to React Testing Library in just 6 weeks using automation and LLMs

medium.com

March 21, 2025 at 11:19 AM

Elie

@ebursztein.bsky.social

[Weekend Read] AI Search Has A Citation Problem - www.cjr.org/tow_center/we-c... TL;DR: Doing an agent is one thing, getting to the level of reliability where an agent can be trusted is a total different ball game. Evaluating reliability is critical to true progress.

#AI #Agent #research

March 15, 2025 at 4:23 PM

Elie

@ebursztein.bsky.social

#Gemma 3 is here! Our new open models are incredibly efficient - the largest 27B model runs on just one H100 GPU. You'd need at least 10x the compute to get similar performance from other models 👇

#AI #LLM

March 13, 2025 at 3:06 AM

Elie

@ebursztein.bsky.social

[Weekend Read] Reasoning Language Models: A Blueprint - arxiv.org/abs/2501.11223 All you need to know on how thinking/reasoning models are trained and evaluated.

#AI #RLM #LLM

March 8, 2025 at 12:17 PM

Elie

@ebursztein.bsky.social

[Weekend Read] The Impact of Generative AI on Critical Thinking: Self-Reported
Reductions in Cognitive Effort and Confidence Effects From a
Survey of Knowledge Workers - www.microsoft.com/en-us/res... The more people are confident in #GenAI the less they think critically.
#Research #AI #education

February 16, 2025 at 6:47 PM

Elie

@ebursztein.bsky.social

[Weekend Read] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training - arxiv.org/pdf/2501.17161 - Using Reinforcement Learning (RL) help generalizing and SFT help stabilizing.

#AI #LLM #research #RL

arxiv.org

February 2, 2025 at 1:06 AM

Elie

@ebursztein.bsky.social

[Weekend Read] Humanity’s Last Exam - static.scale.com/uploads/65... New large scale knowledge benchmark where the best models barely reach 9% with DeepSeek-R1 outperforming everyone.

#benchmark #research #AI #LLM #deepseek #openai #anthropic #gemini

January 25, 2025 at 11:49 PM

Elie

@ebursztein.bsky.social

[Tool Tuesday] LLM the best CLI utility for interacting with Large Models - https://github.com/simonw/llm This comprehensive tool support images, local/remote models and shell workflow. You can for example type: cat https://mycode.py | llm -s "Explain this code"

##LLM ##tool ##AI

January 7, 2025 at 9:00 PM

Elie

@ebursztein.bsky.social

[weekend read] Human Creativity in the Age of LLMs - https://arxiv.org/abs/2410.03703 - Worryingly this study shows that ##AI might boost short-term creativity at the expense of long-term one. Figuring out how to leverage ##LLM without degrading human long-term capabilities is a very pressing issue.

December 1, 2024 at 2:25 AM

Elie

@ebursztein.bsky.social

[Weekend Read] NeRF: Neural Radiance Field in 3D Vision, A Comprehensive Review
: https://arxiv.org/pdf/2210.00379 - NeRF models allow to synthetize and render a 3D scene from all direction based of 2D images. This paper is a good summary of this must know type of model.

##AI ##research ##3D

November 3, 2024 at 7:47 PM

Elie

@ebursztein.bsky.social

[Weekend Read] Scaling Retrieval-Based Language Models with a Trillion-Token Datastore - https://arxiv.org/abs/2407.12854 Good baseline experiment that put RAG lift at about 8% on knowledge tasks with smaller models benefiting more from it.

##LLM # LLama ##research ##AI ##RAG

October 6, 2024 at 3:21 PM

Elie

@ebursztein.bsky.social

[Tool Tuesday] RagFlow: Open-source ##RAG retrieval system - https://github.com/infiniflow/ragflow
It as a nice UI, is easy to deploy and implement a lot of novel techniques such as graphrag. Great out of box system if you want to have a chatbot that leverage specialized data

##LLM ##AI ##OSS

August 27, 2024 at 10:28 PM

Elie

@ebursztein.bsky.social

[Weekend Read] Fairness Definitions in Language Models Explained - https://arxiv.org/pdf/2407.18454 This paper do a great job at explaining simply the key ideas on how to evaluate ##LLM ##fairness and provide key references. Very useful read if you are interested in the topic.

##AI ##Research

arxiv.org

August 24, 2024 at 7:01 PM

Elie

@ebursztein.bsky.social

[Weekend Read] Adversaries Can Misuse Combinations of Safe Models - https://arxiv.org/abs/2406.14595?utm_source=bluesky&utm_campaign=elie Study how to combine frontiers models with weaker ones to complete dangerous tasks.

##LLM ##AI ##Research ##Cybersecurity

July 13, 2024 at 4:43 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news