Brian Naughton
@btnaughton.bsky.social
genetics/data/programming. ex-Hexagon, ex-Stanford ex-23andMe ex-TCD http://blog.booleanbiotech.com 🇮🇪
I was conserving characters but Sally Smith Hughes wrote “Genentech: The Beginnings of Biotech”
October 31, 2025 at 3:45 PM
I was conserving characters but Sally Smith Hughes wrote “Genentech: The Beginnings of Biotech”
Reposted by Brian Naughton
Pretty interesting that AFAICT the filtering was done after the fact (so, library 1 had no filtering). This could make it an excellent dataset for training/testing filters/rankers. Too bad it looks like the dataset is not public
September 29, 2025 at 7:40 PM
Pretty interesting that AFAICT the filtering was done after the fact (so, library 1 had no filtering). This could make it an excellent dataset for training/testing filters/rankers. Too bad it looks like the dataset is not public
It's not too hard, though benchmark.py has too much (AI) code
- add an entry to the yaml file
- add image1.txt through image6.txt to that output dir
- run benchmark.py
I would still manually check. There are cases where >1 answer is acceptable, especially the header can be ambiguous
- add an entry to the yaml file
- add image1.txt through image6.txt to that output dir
- run benchmark.py
I would still manually check. There are cases where >1 answer is acceptable, especially the header can be ambiguous
benchmark.py
May 31, 2025 at 1:30 PM
It's not too hard, though benchmark.py has too much (AI) code
- add an entry to the yaml file
- add image1.txt through image6.txt to that output dir
- run benchmark.py
I would still manually check. There are cases where >1 answer is acceptable, especially the header can be ambiguous
- add an entry to the yaml file
- add image1.txt through image6.txt to that output dir
- run benchmark.py
I would still manually check. There are cases where >1 answer is acceptable, especially the header can be ambiguous
In case anyone wants to know, the winner was Google Translate. For the single error I found (~1/1000 aas), crucially the font messed up (serif -> sans serif) so there was a clue that the OCR should not be trusted for that sequence.
May 21, 2025 at 5:29 PM
In case anyone wants to know, the winner was Google Translate. For the single error I found (~1/1000 aas), crucially the font messed up (serif -> sans serif) so there was a clue that the OCR should not be trusted for that sequence.
Mistral OCR and Google Translate both work much better than Gemini/Claude/GPT (recommended by @josiezayner @draparente on X)
However, in both cases I stopped checking after the 1st error after ~500 aas. Mistral was G->C ("transversion"), GTranslate was SSGGG -> SSSGG ("transcription slippage"!)
However, in both cases I stopped checking after the 1st error after ~500 aas. Mistral was G->C ("transversion"), GTranslate was SSGGG -> SSSGG ("transcription slippage"!)
May 21, 2025 at 3:52 AM
Mistral OCR and Google Translate both work much better than Gemini/Claude/GPT (recommended by @josiezayner @draparente on X)
However, in both cases I stopped checking after the 1st error after ~500 aas. Mistral was G->C ("transversion"), GTranslate was SSGGG -> SSSGG ("transcription slippage"!)
However, in both cases I stopped checking after the 1st error after ~500 aas. Mistral was G->C ("transversion"), GTranslate was SSGGG -> SSSGG ("transcription slippage"!)