ML PhD at @cornellbowers.bsky.social: LLM reasoning, agents, and AI for Science. Can cycle, run, juggle. Currently trying combinations.
Paper on arxiv: 📄 arxiv.org/abs/2502.20377
Paper on arxiv: 📄 arxiv.org/abs/2502.20377
PhantomWiki is the first suite to **quantify** LLM reasoning and retrieval. It is _the_ durable evaluation benchmark we need for the next-generation of LLMs!
PhantomWiki is the first suite to **quantify** LLM reasoning and retrieval. It is _the_ durable evaluation benchmark we need for the next-generation of LLMs!
📈 PhantomWiki scales amazingly. In just 3 secs, we can generate 1K wiki pages, going beyond SOTA LLM 128K token limits. And in hours, Wikipedia-scale 1 million pages!
📈 PhantomWiki scales amazingly. In just 3 secs, we can generate 1K wiki pages, going beyond SOTA LLM 128K token limits. And in hours, Wikipedia-scale 1 million pages!
🚨The universe of people and their relationships are generated randomly. So by construction, LLMs cannot memorize/cheat on PhantomWiki evaluation.
🚨The universe of people and their relationships are generated randomly. So by construction, LLMs cannot memorize/cheat on PhantomWiki evaluation.