Lightnews — Scholar-powered news

Joe Barrow

@jbarrow.bsky.social

110 followers 190 following 23 posts

NLP @ Pattern Data
Prev: Adobe Research, PhD UMD

Posts Replies Media Videos

Joe Barrow

@jbarrow.bsky.social

Now, just because we filtered for the cleanest forms doesn't mean we got _perfect_ forms. There are still a lot of inconsistencies in how people prepare forms! In future work I'll be looking at mitigating data quality issues like these.

September 24, 2025 at 5:51 PM

Joe Barrow

@jbarrow.bsky.social

(Note, this doesn't _just_ apply to Acrobat, it's also better than Apple Preview -- neither Acrobat nor Preview even make an attempt at checkboxes, and they're often fooled by any straight, horizontal line. Left: Acrobat, Right: FFDNet)

September 24, 2025 at 5:51 PM

Joe Barrow

@jbarrow.bsky.social

If we train object detectors to find the form fields on these pages, we get a much cleaner set of forms than if you used Acrobat to automatically prepare your form. (Left: Acrobat, Right: FFDNet).

September 24, 2025 at 5:51 PM

Joe Barrow

@jbarrow.bsky.social

Step 1 is to filter out for the cleanest forms possible. We start with 8MM PDFs from Common Crawl, and work our way down to ~60k of the cleanest forms we can find. The results is a ~500k page dataset, called CommonForms.

September 24, 2025 at 5:51 PM

Joe Barrow

@jbarrow.bsky.social

Paper thread of some work I’m *incredibly* proud of, my first single author paper!

Converting a PDF to a fillable form is a hard problem, and a lot of solutions don’t work very well! In CommonForms, I show that you can train models that outperform Adobe Acrobat for <$500! 🧵