Elie
ebursztein.bsky.social
Elie
@ebursztein.bsky.social
[Weekend Read] DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents - arxiv.org/pdf/2506.11763 The best deep research AI agents have yet to cross the 50% mark in terms of comprehensiveness and depth . They also exhibit a 20% citation hallucination rate.
#AI #Research #search
September 7, 2025 at 8:51 PM
[Weekend Read] Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence - lnkd.in/gnBWstn7 Early impact of #AI on #employment seems to indicate that entry-level white-collar jobs including software engineering, marketing, and support are the most affected.
September 1, 2025 at 4:25 AM
[Weekend Read] How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? arxiv.org/abs/2502.14502 TL;DR:Don't use LoRA to add knowledge to LLMs.

Full research note and previous weekend read papers: notes.elie.net/Papers+revie...

#AI #LLM #LoRa
August 23, 2025 at 2:18 PM
[Weekend Read] Subliminal Learning Language models transmit behavioral traits via hidden signals in data arxiv.org/abs/2507.14805 Surprisingly during knowledge distillation, student models unconsciously acquire teacher model characteristics even when training on unrelated data

#AI #LLM #RLM
August 16, 2025 at 4:23 PM
Happy to announce that we open sourced LMEval, a large model evaluation framework purposely built to accurately and efficiently compare how models from various providers perform across benchmark datasets opensource.googleblog.com/2...

#AI #LLM #OSS
May 27, 2025 at 8:00 PM
The leaderboard illusion: arxiv.org/abs/2504.20879 Look at some of the shortcomings of the Chatbot arena which has emerged as the go-to leaderboard for ranking the most capable #AI . Recently Meta abused some of them to game #Llama 4 Behemoth results - sherwood.news/tech/meta-scr...
May 24, 2025 at 9:05 AM
The Phare Benchmark results key insights include that popularity on benchmarks like LMArena doesn't guarantee factual reliability and that the more confidently a user phrase its query the less willing models are willing to refute controversial claims (sycophancy) - www.giskard.ai/knowledge/go...
May 1, 2025 at 2:55 AM
[Weekend Read] Exploring LLM Reasoning Through Controlled Prompt Variations - arxiv.org/abs/2504.02111 Show how critical it is to have only relevant data in the model context. Accurately filtering out data is very difficult and simply relying on vector search is NOT the answer.
#AI #RAG
April 27, 2025 at 12:23 AM
[Weekend read] Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification - arxiv.org/pdf/2502.01839 If you are interested in understanding how scaling computation at generation time help improve model performance this is the paper to read.
April 6, 2025 at 7:43 AM
[Weekend Read] Measuring AI Ability to Complete Long Tasks - arxiv.org/pdf/2503.14499 The ability of models to perform longer and longer tasks roughly double every 7 months. That's encouraging however I am unsure how true this hold at 99% success rate which is needed to trust agents.

#AI #LLM
March 22, 2025 at 4:30 PM
[Weekend Read] AI Search Has A Citation Problem - www.cjr.org/tow_center/we-c... TL;DR: Doing an agent is one thing, getting to the level of reliability where an agent can be trusted is a total different ball game. Evaluating reliability is critical to true progress.

#AI #Agent #research
March 15, 2025 at 4:23 PM
#Gemma 3 is here! Our new open models are incredibly efficient - the largest 27B model runs on just one H100 GPU. You'd need at least 10x the compute to get similar performance from other models 👇

#AI #LLM
March 13, 2025 at 3:06 AM
[Weekend Read] Reasoning Language Models: A Blueprint - arxiv.org/abs/2501.11223 All you need to know on how thinking/reasoning models are trained and evaluated.

#AI #RLM #LLM
March 8, 2025 at 12:17 PM
[Weekend Read] The Impact of Generative AI on Critical Thinking: Self-Reported
Reductions in Cognitive Effort and Confidence Effects From a
Survey of Knowledge Workers - www.microsoft.com/en-us/res... The more people are confident in #GenAI the less they think critically.
#Research #AI #education
February 16, 2025 at 6:47 PM
[Weekend Read] Humanity’s Last Exam - static.scale.com/uploads/65... New large scale knowledge benchmark where the best models barely reach 9% with DeepSeek-R1 outperforming everyone.

#benchmark #research #AI #LLM #deepseek #openai #anthropic #gemini
January 25, 2025 at 11:49 PM
[Tool Tuesday] LLM the best CLI utility for interacting with Large Models - https://github.com/simonw/llm This comprehensive tool support images, local/remote models and shell workflow. You can for example type: cat https://mycode.py | llm -s "Explain this code"

##LLM ##tool ##AI
January 7, 2025 at 9:00 PM
[weekend read] Human Creativity in the Age of LLMs - https://arxiv.org/abs/2410.03703 - Worryingly this study shows that ##AI might boost short-term creativity at the expense of long-term one. Figuring out how to leverage ##LLM without degrading human long-term capabilities is a very pressing issue.
December 1, 2024 at 2:25 AM
[Weekend Read] NeRF: Neural Radiance Field in 3D Vision, A Comprehensive Review
: https://arxiv.org/pdf/2210.00379 - NeRF models allow to synthetize and render a 3D scene from all direction based of 2D images. This paper is a good summary of this must know type of model.

##AI ##research ##3D
November 3, 2024 at 7:47 PM
[Weekend Read] Scaling Retrieval-Based Language Models with a Trillion-Token Datastore - https://arxiv.org/abs/2407.12854 Good baseline experiment that put RAG lift at about 8% on knowledge tasks with smaller models benefiting more from it.

##LLM # LLama ##research ##AI ##RAG
October 6, 2024 at 3:21 PM
[Tool Tuesday] RagFlow: Open-source ##RAG retrieval system - https://github.com/infiniflow/ragflow
It as a nice UI, is easy to deploy and implement a lot of novel techniques such as graphrag. Great out of box system if you want to have a chatbot that leverage specialized data

##LLM ##AI ##OSS
August 27, 2024 at 10:28 PM
[Weekend Read] Adversaries Can Misuse Combinations of Safe Models - https://arxiv.org/abs/2406.14595?utm_source=bluesky&utm_campaign=elie Study how to combine frontiers models with weaker ones to complete dangerous tasks.

##LLM ##AI ##Research ##Cybersecurity
July 13, 2024 at 4:43 PM
[Weekend Read] A Careful Examination of Large Language Model Performance on Grade School Arithmetic: http://arxiv.org/abs/2405.00332 By creating from scratch a new math benchmark (GSM1K) the authors show that many models training data are likely polluted with benchmark data.

##AI ##LLM ##GPT
June 1, 2024 at 10:54 PM
[Weekend Read] Better & Faster Large Language Models via Multi-token Prediction - arxiv.org/abs/2404.19737 Quite an interesting technique to increase LLM performance and speed: predict multiple tokens instead of just the next one.
#AI #Research #LLM
May 12, 2024 at 3:18 AM
[Weekend Read] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training - arxiv.org/abs/2403.09611 this paper study the factors affecting Multimodal LLM performance including what I think is the best analysis of how data mixture greatly affect performance.

#AI #Research #LLM #VLM
April 13, 2024 at 8:59 PM
28% of the private companies lost their unicorn ($1B+ valuation) status according to new research - blog.equityzen.com/the-state-of... Is the pandemic bubble popping?

#startups #vc #tech
April 2, 2024 at 4:37 AM