📖 Paper: arxiv.org/pdf/2512.06688
🤗 Data: huggingface.co/datasets/bo...
🎉 Huge thanks to my amazing collaborators, mentors, and advisors to make this work possible.
🧵(5/5)
📖 Paper: arxiv.org/pdf/2512.06688
🤗 Data: huggingface.co/datasets/bo...
🎉 Huge thanks to my amazing collaborators, mentors, and advisors to make this work possible.
🧵(5/5)
🏆 With this data, a small reasoning model, Qwen3-4B, outperforms GPT-5, and an agentic memory delivers SOTA personalization with 16× efficiency.
🧵(3/5)
🏆 With this data, a small reasoning model, Qwen3-4B, outperforms GPT-5, and an agentic memory delivers SOTA personalization with 16× efficiency.
🧵(3/5)
🌈 It spans 1000 personas and 20,000+ preferences over 300+ topics, enabling richer training and evaluation for personalized AI.
🧵(4/5)
🌈 It spans 1000 personas and 20,000+ preferences over 300+ topics, enabling richer training and evaluation for personalized AI.
🧵(4/5)
This reflects a broader shift: AI that serves millions of users shall move beyond one-size-fits-all behaviors, enabling long-term user engagements.
🧵(2/5)
This reflects a broader shift: AI that serves millions of users shall move beyond one-size-fits-all behaviors, enabling long-term user engagements.
🧵(2/5)
🗣️ LLMs recall basic facts and preferences fine -- but struggle to apply your latest preferences in their responses.
🚨Hardest part? Applying your preferences in new situations.
🔍 RAG and 🧠 external memory modules help in personalization. (6/8)
🗣️ LLMs recall basic facts and preferences fine -- but struggle to apply your latest preferences in their responses.
🚨Hardest part? Applying your preferences in new situations.
🔍 RAG and 🧠 external memory modules help in personalization. (6/8)
📊 Gemini-1.5, GPT-4.5, and GPT-4.1 lead in overall accuracy, but still hover around 52% on multiple-choice.
🤔 Reasoning models (o4-mini, o1, and DeepSeek-R1) do not outperform their non-reasoning peers. (5/8)
📊 Gemini-1.5, GPT-4.5, and GPT-4.1 lead in overall accuracy, but still hover around 52% on multiple-choice.
🤔 Reasoning models (o4-mini, o1, and DeepSeek-R1) do not outperform their non-reasoning peers. (5/8)
📄 arXiv arxiv.org/pdf/2504.14225
🌐 Project Page zhuoqunhao.github.io/PersonaMem....
🐙 GitHub github.com/bowen-upenn...
🤗 Hugging Face huggingface.co/datasets/bo...
(3/8)
📄 arXiv arxiv.org/pdf/2504.14225
🌐 Project Page zhuoqunhao.github.io/PersonaMem....
🐙 GitHub github.com/bowen-upenn...
🤗 Hugging Face huggingface.co/datasets/bo...
(3/8)
👩🏻💻Evaluate LLM's ability to understand evolving persona from 180+ multi-session user-chatbot conversation history
🌟Realistic long-context evaluation up to 1M tokens (2/8)
👩🏻💻Evaluate LLM's ability to understand evolving persona from 180+ multi-session user-chatbot conversation history
🌟Realistic long-context evaluation up to 1M tokens (2/8)