Joe Barrow
jbarrow.bsky.social
Joe Barrow
@jbarrow.bsky.social
NLP @ Pattern Data
Prev: Adobe Research, PhD UMD
Paper thread of some work I’m *incredibly* proud of, my first single author paper!

Converting a PDF to a fillable form is a hard problem, and a lot of solutions don’t work very well! In CommonForms, I show that you can train models that outperform Adobe Acrobat for <$500! 🧵
September 24, 2025 at 5:51 PM
In which I argue that LLM-generated bounding boxes are impressive, but not that useful (yet): notes.penpusher.app/Misc/Horsesh...
Horseshoes (and Hand Grenades) - LLM Localization is not Close, but not Close Enough - Joe Barrow
TL;DRLarge Multimodal Models (LMMs) can now output bounding boxes when given images as inputs. The results are impressive, but for documents they aren't good enough for real world use, yet. The Probl…
notes.penpusher.app
April 23, 2025 at 5:33 PM
Ah, yes, that ol' familiar unit of measure "AI TOPS"
January 7, 2025 at 8:13 PM
I put together a little guide on getting started with Google Gemini -- how to make multimodal calls, get structured outputs, and image bounding boxes to build an object detector.

notes.penpusher.app/Misc/Google+...
Google Gemini 101 - Object Detection with Vision and Structured Outputs - Joe Barrow - Obsidian Publish
This is a missing manual for how to get a simple working prototype up and running with Gemini's vision mode and structured outputs. I'm confident that manual exists elsewhere, but I haven't been able…
publish.obsidian.md
December 20, 2024 at 9:51 AM
Reposted by Joe Barrow
December 18, 2024 at 10:37 AM
Gemini 2.0 Flash is pretty good at localization in images. for an LMM (much better than GPT-4o in my experiments).
December 18, 2024 at 9:26 AM
ML history question: is there an earlier reference to pixel-only in-context (i.e. no fine-tuning) DocVQA performance than the GPT-4 announcement from OpenAI?
December 9, 2024 at 9:45 AM