Peng Qi
qi2peng2.bsky.social
Peng Qi
@qi2peng2.bsky.social
Multimodal Agents Research @ Orby AI. Ex-AWS AI, JD AI. PhD from @stanfordnlp.bsky.social, UG Tsinghua U. He/him. Opinions my own.

This project was joint work with my Amazon colleagues (led by Yumo Xu), and it's great to see it finally published. Hope this helps motivate more careful eval work in the near future!

#AI #agent #evaluation #RAG #NLP
July 16, 2025 at 12:40 AM
b) as builders, we evaluate the technology soberly and help users navigate these risks in product design.

Want to learn more? Checkout
Our paper: arxiv.org/pdf/2506.01829
Open-source code: github.com/amazon-scien...
July 16, 2025 at 12:40 AM
Why should you care? As businesses / individuals leverage AI more and more to speed up research and decision-making, it is important that, a) as users, we examine the tools we are using to understand their limitations and avoid pitfalls with significant potential downsides, and
July 16, 2025 at 12:40 AM
With a new, carefully annotated dataset and an automated evaluation metric we designed, we find that although LLMs are reasonably good at citing accurate sources most of the time, SOTA LLMs still cite incorrectly 5-28% of the time, and miss citations anywhere from 16% to an alarming 95% of the time.
July 16, 2025 at 12:40 AM
"𝐂𝐢𝐭𝐞𝐄𝐯𝐚𝐥: 𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞-𝐃𝐫𝐢𝐯𝐞𝐧 𝐂𝐢𝐭𝐚𝐭𝐢𝐨𝐧 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐟𝐨𝐫 𝐒𝐨𝐮𝐫𝐜𝐞 𝐀𝐭𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧", we propose a framework to systematically study citation accuracy by considering previously neglected contexts such as user-provided information and LLMs' parametric knowledge.
July 16, 2025 at 12:40 AM
AI agents research, and share a skeleton of a research proposal if I ever find the time / need to write one from my perspective today: qipeng.me/blog/stop-us...
Why You Should Stop Using HotpotQA for AI Agents Evaluation in 2025 | Peng Qi
We published HotpotQA, a groundbreaking multi-step question answering dataset in 2018, which has since motivated and facilitated numerous AI agent research works. But you should probably reconsider…
qipeng.me
July 2, 2025 at 6:39 PM
a good problem in its historical context, some of my own research attempts at solving this problem that I believe are on the critical path to autonomous agents, and what's changed today to make the dataset less relevant in its original form. I also reflect on the possible paths forward for ...
Why You Should Stop Using HotpotQA for AI Agents Evaluation in 2025 | Peng Qi
We published HotpotQA, a groundbreaking multi-step question answering dataset in 2018, which has since motivated and facilitated numerous AI agent research works. But you should probably reconsider…
qipeng.me
July 2, 2025 at 6:39 PM
The longer employers don’t acknowledge and embrace this discrepancy, the faster they lose the top candidates they spent enormous efforts to hire and retain, leaving the organization in self-fulfilling mediocrity.
May 22, 2025 at 3:01 PM
but not perfectly aligned with those of the employer. They ask: Will I get the opportunity to build a career beyond what is immediately required of me? Will I learn and grow, be part of a great team and culture? Will I make a name for myself while doing great work? Will I remain competitive?
3/
May 22, 2025 at 3:01 PM
We are never settling for a candidate that does exactly the thing that needs to be done right now, since that thing itself can change before you know it.

But too often employers and managers forget, that highly motivated and capable candidates also hold expectations parallel to these, ...
2/
May 22, 2025 at 3:01 PM
While many aspects of our work (especially in the digital world) can be amenable to #AI #automation, it is also through automation that we continuously rediscover again and again the true meaning of our work and our unique humanness.

#MondayReflection /fin
May 6, 2025 at 1:27 AM
this coding phase alone, and I ended up delivering something slightly better than I would've done without it.

As with any technological evolution, tools themselves never fully replace the humans doing the work, but greatly enhance the ones that embrace them and adapt to working with them. 6/
May 6, 2025 at 1:27 AM
But, of course, this has its implications. I did save a lot of time looking up programming resources on things I have a vague understanding of and wasn't very familiar with, and didn't have to type all those many characters. By my estimate, the AI assistant did save me 50-80% of the effort of 5/
May 6, 2025 at 1:27 AM
(especially to the general public) the fact that the act of putting code down is typically the *least* mentally effortful part of the work. It's as if saying "my 3D printer made 100% of my new shiny collections" -- true in the narrow sense of the printing effort, but it's missing the point. 4/
May 6, 2025 at 1:27 AM
how to fix things that aren't working (the AI helped a bit with this too at times), and how to keep things future-proof. In this regard, I still did >90% of the most important *work* in this project. Saying AI 99% of the code, while factually correct in one particular sense (line count), obscures 3/
May 6, 2025 at 1:27 AM
AI with no manual edits from me, or that the AI assistant was the last to "touch" those lines of code.

What this 99% number oversimplifies is the amount of time my colleague and I engage in numerous offline discussions, times where I had to stop and think about what to ask the AI to code next, 2/
May 6, 2025 at 1:27 AM
In one of my recent projects, AI code assistants actually DID write 99% of my code, and the project was reasonably complex starting from scratch. Does this mean I'm obsolete now? Here's the catch:when I say AI wrote 99% of the code, I was counting roughly how many lines were directly generated by 1/
May 6, 2025 at 1:27 AM