Lightnews — Scholar-powered news

Florian Huber

@me-datapoint.bsky.social

2.1K followers 600 following 48 posts

Professor for data science at HSD, @zdd-hsd.bsky.social
| ML fan & critic | current research mostly #datascience, #machinelearning, #cheminformatics #dataviz #nlp | ✨ #openscience #openaccess #rse | living data point 🚲

Posts Replies Media Videos

Florian Huber

@me-datapoint.bsky.social

Special thanks to @julianpollmann.bsky.social and Niek de Jonge for code and code reviews!

GitHub: github.com/matchms/matc...

#opensource #RSE #researchsoftwareengineering

GitHub - matchms/matchms: Python library for processing (tandem) mass spectrometry data and for computing spectral similarities.

Python library for processing (tandem) mass spectrometry data and for computing spectral similarities. - matchms/matchms

github.com

October 6, 2025 at 4:00 PM

Florian Huber

@me-datapoint.bsky.social

Great post!

We also noted the same thing, which triggered us to point out some pitfalls of various fingerprints --> www.biorxiv.org/content/10.1...

www.biorxiv.org

July 17, 2025 at 11:40 AM

Florian Huber

@me-datapoint.bsky.social

4/4
We also highlight options for count fingerprints, such as log-counts and IDF weighted counts. The latter can be used to adjust the bit importance to a dataset of your choice.

An example use-case are chemical space visualizations.

Preprint: www.biorxiv.org/content/10.1...

Chemical Space Visualizations using UMAP and various molecular fingerprints.

June 23, 2025 at 9:22 AM

Florian Huber

@me-datapoint.bsky.social

3/4
A huge issue is bit collisions.
Fingerprints with a high bit occupation (RDKit, MAP4) often lead to (1) arbitrary misinterpretations, (2) shifts to high Tanimoto scores, (3) very different handling of small and large molecules.

--> Consider using sparse fingerprints!
--> Morgan >> MAP4 / RDKit

June 23, 2025 at 9:22 AM

Florian Huber

@me-datapoint.bsky.social

2/4
We focused on weaknesses of the fingerprints.
Many show frequent duplicates, so same fingerprint for different compounds. Most problematic: this can include *very* different compounds ending up with identical fingerprints.

- MAP4 >> Morgan-type >> daylight
- count >> binary

#cheminformatics

Benchmarking plot on fingerprint duplications.

June 23, 2025 at 9:22 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news