Lightnews — Scholar-powered news

Brian Naughton

@btnaughton.bsky.social

500 followers 680 following 140 posts

genetics/data/programming. ex-Hexagon, ex-Stanford ex-23andMe ex-TCD http://blog.booleanbiotech.com 🇮🇪

Posts Replies Media Videos

Brian Naughton

@btnaughton.bsky.social

I was conserving characters but Sally Smith Hughes wrote “Genentech: The Beginnings of Biotech”

October 31, 2025 at 3:45 PM

Reposted by Brian Naughton

Nick Boyd

@nboyd.bsky.social

Pretty interesting that AFAICT the filtering was done after the fact (so, library 1 had no filtering). This could make it an excellent dataset for training/testing filters/rankers. Too bad it looks like the dataset is not public

September 29, 2025 at 7:40 PM

Brian Naughton

@btnaughton.bsky.social

It's not too hard, though benchmark.py has too much (AI) code

- add an entry to the yaml file
- add image1.txt through image6.txt to that output dir
- run benchmark.py

I would still manually check. There are cases where >1 answer is acceptable, especially the header can be ambiguous

benchmark.py

May 31, 2025 at 1:30 PM

Brian Naughton

@btnaughton.bsky.social

In case anyone wants to know, the winner was Google Translate. For the single error I found (~1/1000 aas), crucially the font messed up (serif -> sans serif) so there was a clue that the OCR should not be trusted for that sequence.

May 21, 2025 at 5:29 PM

Brian Naughton

@btnaughton.bsky.social

Mistral OCR and Google Translate both work much better than Gemini/Claude/GPT (recommended by @josiezayner @draparente on X)

However, in both cases I stopped checking after the 1st error after ~500 aas. Mistral was G->C ("transversion"), GTranslate was SSGGG -> SSSGG ("transcription slippage"!)

May 21, 2025 at 3:52 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news