Sora
aoi-sora-2000.bsky.social
Sora
@aoi-sora-2000.bsky.social
A quiet, reserved human interested in AI, tech, and everything sane. 🔒🪑
I have a fairly broad spread of article types, so I don't honestly expect my implementation to work with 100% accuracy. I'm yet to see what improvements have been provided in the recent update. I'll keep you apprised if I find a solution to my problems with fit alone.
November 20, 2024 at 6:45 PM
1. Issues in the placement of figure captions below the figures
2. Fails most of the time with multicolumn PDFs
3. Fails with complex infographics
4. Fails with equations

That said, for a regular single-column research paper, it's very good.
November 20, 2024 at 6:43 PM
Also, PyMuPDF got some remarkable upgrades (in case you haven't seen the docs recently)
pymupdf.readthedocs.io/en/latest/
PyMuPDF 1.24.14 documentationContentsMenuExpandLight modeDark modeAuto light/dark mode
PyMuPDF is a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
pymupdf.readthedocs.io
November 20, 2024 at 6:37 PM
Hey @jamiecummins.bsky.social, your approach has largely solved the issue of format preservation, although about 25-30% of content is still poorly extracted.
Still working on the other pet project - DocOwl.
November 20, 2024 at 6:36 PM
I concur. For academic papers, GROBID is the best, although it falters with metadata occasionally.
I'm currently trying out something completely different:
github.com/X-PLUG/mPLUG...

I'll first try your recommended approach. If that works for fitz, it'll save me a ton of time and new explorations.
mPLUG-DocOwl/DocOwl2 at main · X-PLUG/mPLUG-DocOwl
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding - X-PLUG/mPLUG-DocOwl
github.com
November 19, 2024 at 4:28 PM
I'd used fitz as is without any coordinate matching. I'll text it out and share my findings here. Fitz is otherwise superb when it comes to processing speed and overall extraction quality.
November 19, 2024 at 4:20 PM
Still perusing your blog as we speak. My horrors with fitz made me ask that right off the bat. Thank you kindly for the reply. I'll try out your approach, Jamie.
November 19, 2024 at 4:19 PM
@jamiecummins.bsky.social a question though: is the text structure maintained well in the raw text outputs you get from pymupdf?
I've found that it ruins the structure of research documents.
November 19, 2024 at 4:15 PM