Lightnews — Scholar-powered news

Sora

@aoi-sora-2000.bsky.social

54 followers 230 following 8 posts

A quiet, reserved human interested in AI, tech, and everything sane. 🔒🪑

Posts Replies Media Videos

Sora

@aoi-sora-2000.bsky.social

I have a fairly broad spread of article types, so I don't honestly expect my implementation to work with 100% accuracy. I'm yet to see what improvements have been provided in the recent update. I'll keep you apprised if I find a solution to my problems with fit alone.

November 20, 2024 at 6:45 PM

Sora

@aoi-sora-2000.bsky.social

1. Issues in the placement of figure captions below the figures
2. Fails most of the time with multicolumn PDFs
3. Fails with complex infographics
4. Fails with equations

That said, for a regular single-column research paper, it's very good.

November 20, 2024 at 6:43 PM

Sora

@aoi-sora-2000.bsky.social

Also, PyMuPDF got some remarkable upgrades (in case you haven't seen the docs recently)
pymupdf.readthedocs.io/en/latest/

PyMuPDF 1.24.14 documentationContentsMenuExpandLight modeDark modeAuto light/dark mode

PyMuPDF is a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

pymupdf.readthedocs.io

November 20, 2024 at 6:37 PM

Sora

@aoi-sora-2000.bsky.social

Hey @jamiecummins.bsky.social, your approach has largely solved the issue of format preservation, although about 25-30% of content is still poorly extracted.
Still working on the other pet project - DocOwl.

November 20, 2024 at 6:36 PM

Sora

@aoi-sora-2000.bsky.social

I concur. For academic papers, GROBID is the best, although it falters with metadata occasionally.
I'm currently trying out something completely different:
github.com/X-PLUG/mPLUG...

I'll first try your recommended approach. If that works for fitz, it'll save me a ton of time and new explorations.

mPLUG-DocOwl/DocOwl2 at main · X-PLUG/mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding - X-PLUG/mPLUG-DocOwl

github.com

November 19, 2024 at 4:28 PM

Sora

@aoi-sora-2000.bsky.social

I'd used fitz as is without any coordinate matching. I'll text it out and share my findings here. Fitz is otherwise superb when it comes to processing speed and overall extraction quality.

November 19, 2024 at 4:20 PM

Sora

@aoi-sora-2000.bsky.social

Still perusing your blog as we speak. My horrors with fitz made me ask that right off the bat. Thank you kindly for the reply. I'll try out your approach, Jamie.

November 19, 2024 at 4:19 PM

Sora

@aoi-sora-2000.bsky.social

@jamiecummins.bsky.social a question though: is the text structure maintained well in the raw text outputs you get from pymupdf?
I've found that it ruins the structure of research documents.

November 19, 2024 at 4:15 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news