Elie
ebursztein.bsky.social
Elie
@ebursztein.bsky.social
[Weekend Read] DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents - arxiv.org/pdf/2506.11763 The best deep research AI agents have yet to cross the 50% mark in terms of comprehensiveness and depth . They also exhibit a 20% citation hallucination rate.
#AI #Research #search
September 7, 2025 at 8:51 PM
[Weekend Read] Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence - lnkd.in/gnBWstn7 Early impact of #AI on #employment seems to indicate that entry-level white-collar jobs including software engineering, marketing, and support are the most affected.
September 1, 2025 at 4:25 AM
[Weekend Read] How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? arxiv.org/abs/2502.14502 TL;DR:Don't use LoRA to add knowledge to LLMs.

Full research note and previous weekend read papers: notes.elie.net/Papers+revie...

#AI #LLM #LoRa
August 23, 2025 at 2:18 PM
[Weekend Read] Subliminal Learning Language models transmit behavioral traits via hidden signals in data arxiv.org/abs/2507.14805 Surprisingly during knowledge distillation, student models unconsciously acquire teacher model characteristics even when training on unrelated data

#AI #LLM #RLM
August 16, 2025 at 4:23 PM
Happy to announce that we open sourced LMEval, a large model evaluation framework purposely built to accurately and efficiently compare how models from various providers perform across benchmark datasets opensource.googleblog.com/2...

#AI #LLM #OSS
May 27, 2025 at 8:00 PM
The leaderboard illusion: arxiv.org/abs/2504.20879 Look at some of the shortcomings of the Chatbot arena which has emerged as the go-to leaderboard for ranking the most capable #AI . Recently Meta abused some of them to game #Llama 4 Behemoth results - sherwood.news/tech/meta-scr...
May 24, 2025 at 9:05 AM
The Phare Benchmark results key insights include that popularity on benchmarks like LMArena doesn't guarantee factual reliability and that the more confidently a user phrase its query the less willing models are willing to refute controversial claims (sycophancy) - www.giskard.ai/knowledge/go...
May 1, 2025 at 2:55 AM
[Weekend Read] Exploring LLM Reasoning Through Controlled Prompt Variations - arxiv.org/abs/2504.02111 Show how critical it is to have only relevant data in the model context. Accurately filtering out data is very difficult and simply relying on vector search is NOT the answer.
#AI #RAG
April 27, 2025 at 12:23 AM
[Weekend Read] RealHarm: A Collection of Real-World Language Model Application Failures arxiv.org/abs/2504.10277 This paper by looking at real world examples of AI failure highlight the disconnect between what safety filters block and what goes wrong in practice.

#safety #research #ai
RealHarm: A Collection of Real-World Language Model Application Failures
Language model deployments in consumer-facing applications introduce numerous risks. While existing research on harms and hazards of such applications follows top-down approaches derived from regulatory frameworks and theoretical analyses, empirical
arxiv.org
April 19, 2025 at 4:36 PM
[Weekend read] Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification - arxiv.org/pdf/2502.01839 If you are interested in understanding how scaling computation at generation time help improve model performance this is the paper to read.
April 6, 2025 at 7:43 AM
[Weekend Read] Measuring AI Ability to Complete Long Tasks - arxiv.org/pdf/2503.14499 The ability of models to perform longer and longer tasks roughly double every 7 months. That's encouraging however I am unsure how true this hold at 99% success rate which is needed to trust agents.

#AI #LLM
March 22, 2025 at 4:30 PM
Accelerating Large-Scale Test Migration with LLMs - medium.com/airbnb-engineeri... Airbnb was able to leverage AI to reduce migration time by about 90% (6weeks / 1.5y)

#AI #airbnb
Accelerating Large-Scale Test Migration with LLMs
How Airbnb migrated nearly 3.5K Enzyme test files to React Testing Library in just 6 weeks using automation and LLMs
medium.com
March 21, 2025 at 11:19 AM
[Weekend Read] AI Search Has A Citation Problem - www.cjr.org/tow_center/we-c... TL;DR: Doing an agent is one thing, getting to the level of reliability where an agent can be trusted is a total different ball game. Evaluating reliability is critical to true progress.

#AI #Agent #research
March 15, 2025 at 4:23 PM
#Gemma 3 is here! Our new open models are incredibly efficient - the largest 27B model runs on just one H100 GPU. You'd need at least 10x the compute to get similar performance from other models 👇

#AI #LLM
March 13, 2025 at 3:06 AM
[Weekend Read] Reasoning Language Models: A Blueprint - arxiv.org/abs/2501.11223 All you need to know on how thinking/reasoning models are trained and evaluated.

#AI #RLM #LLM
March 8, 2025 at 12:17 PM
[Weekend Read] The Impact of Generative AI on Critical Thinking: Self-Reported
Reductions in Cognitive Effort and Confidence Effects From a
Survey of Knowledge Workers - www.microsoft.com/en-us/res... The more people are confident in #GenAI the less they think critically.
#Research #AI #education
February 16, 2025 at 6:47 PM
[Weekend Read] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training - arxiv.org/pdf/2501.17161 - Using Reinforcement Learning (RL) help generalizing and SFT help stabilizing.

#AI #LLM #research #RL
arxiv.org
February 2, 2025 at 1:06 AM
[Weekend Read] Humanity’s Last Exam - static.scale.com/uploads/65... New large scale knowledge benchmark where the best models barely reach 9% with DeepSeek-R1 outperforming everyone.

#benchmark #research #AI #LLM #deepseek #openai #anthropic #gemini
January 25, 2025 at 11:49 PM
[Tool Tuesday] LLM the best CLI utility for interacting with Large Models - https://github.com/simonw/llm This comprehensive tool support images, local/remote models and shell workflow. You can for example type: cat https://mycode.py | llm -s "Explain this code"

##LLM ##tool ##AI
January 7, 2025 at 9:00 PM
[weekend read] Human Creativity in the Age of LLMs - https://arxiv.org/abs/2410.03703 - Worryingly this study shows that ##AI might boost short-term creativity at the expense of long-term one. Figuring out how to leverage ##LLM without degrading human long-term capabilities is a very pressing issue.
December 1, 2024 at 2:25 AM
[Weekend Read] NeRF: Neural Radiance Field in 3D Vision, A Comprehensive Review
: https://arxiv.org/pdf/2210.00379 - NeRF models allow to synthetize and render a 3D scene from all direction based of 2D images. This paper is a good summary of this must know type of model.

##AI ##research ##3D
November 3, 2024 at 7:47 PM
[Weekend Read] Scaling Retrieval-Based Language Models with a Trillion-Token Datastore - https://arxiv.org/abs/2407.12854 Good baseline experiment that put RAG lift at about 8% on knowledge tasks with smaller models benefiting more from it.

##LLM # LLama ##research ##AI ##RAG
October 6, 2024 at 3:21 PM
[Tool Tuesday] RagFlow: Open-source ##RAG retrieval system - https://github.com/infiniflow/ragflow
It as a nice UI, is easy to deploy and implement a lot of novel techniques such as graphrag. Great out of box system if you want to have a chatbot that leverage specialized data

##LLM ##AI ##OSS
August 27, 2024 at 10:28 PM
[Weekend Read] Fairness Definitions in Language Models Explained - https://arxiv.org/pdf/2407.18454 This paper do a great job at explaining simply the key ideas on how to evaluate ##LLM ##fairness and provide key references. Very useful read if you are interested in the topic.

##AI ##Research
arxiv.org
August 24, 2024 at 7:01 PM
[Weekend Read] Adversaries Can Misuse Combinations of Safe Models - https://arxiv.org/abs/2406.14595?utm_source=bluesky&utm_campaign=elie Study how to combine frontiers models with weaker ones to complete dangerous tasks.

##LLM ##AI ##Research ##Cybersecurity
July 13, 2024 at 4:43 PM