Benjamin Laufer
banner
laufer.bsky.social
Benjamin Laufer
@laufer.bsky.social
PhD student at Cornell Tech.

bendlaufer.github.io
Thanks Suresh! It’s still work in progress so let us know if you have any feedback.
August 15, 2025 at 12:14 AM
It should be in the thread somewhere!

arxiv.org/pdf/2508.06811
arxiv.org
August 14, 2025 at 4:36 PM
We describe these details in the paper, e.g. in the schematic below.
August 14, 2025 at 4:26 PM
One could imagine a future scenario in which merges lead to a phase transition in the HF graph where all families end up "marrying" creating a single connected component. We're not there now.
August 14, 2025 at 4:26 PM
Thanks! Yes - model merges are akin to sexual reproduction where there are multiple parents, so any graph that includes merges would not be a tree. All other relations have one parent. We leave for future work dynamics arising from merges, and they are a small minority of relations as of now.
August 14, 2025 at 4:26 PM
Oops-- empty tag. My collaborator's name is Hamidah Oderinwale.
August 14, 2025 at 3:08 PM
💻Our codebase for the analyses in the paper: github.com/bendlaufer/a...
GitHub - bendlaufer/ai-ecosystem: Gathering and analysis of a snapshot of all models on Hugging Face.
Gathering and analysis of a snapshot of all models on Hugging Face. - bendlaufer/ai-ecosystem
github.com
August 14, 2025 at 3:06 PM
🗄️Full dataset is here, for those interested in toying around with it: huggingface.co/datasets/mod...
modelbiome/ai_ecosystem_withmodelcards · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
August 14, 2025 at 3:06 PM
This is just the start. We hope our dataset & methods open the door to a science of AI ecosystems.
If you care about open-source AI, governance, or the weird ways technology evolves, give it a read.
📄Paper: arxiv.org/pdf/2508.06811
arxiv.org
August 14, 2025 at 3:06 PM
Big picture: By treating ML models like organisms in an ecosystem, we can:
🌱 Understand the pressures shaping AI development
🔍 Spot patterns before they become industry norms
🛠 Inform governance & safety strategies grounded in real data
August 14, 2025 at 3:06 PM
We found optimal evolutionary orderings over traits:
🔹 Feature extraction tends to be upstream from text generation. Text generation is upstream from text classification.
🔹 Certain license types precede others (e.g., llama3 → apache-2.0)
Here we show the top-20 licenses transitions over fine-tunes.
August 14, 2025 at 3:06 PM
The license drift to permissiveness suggests open-source preferences outweigh regulatory pressures to comply with licenses.
The English drift suggests a massive market for English products.
The docs drift could be explained as a preference for efficiency — or laziness.
August 14, 2025 at 3:06 PM
Three major drifts we found:
1️⃣ Licenses: from corporate to other types. We often see use restrictions mutate to permissive or copyleft (even when counter to upstream license terms)
2️⃣ Languages: from multilingual → English-only
3️⃣ Docs: from long & detailed → short & templated
August 14, 2025 at 3:06 PM
In biology, traits get passed from parent to child — mutations are slow & often modeled as random.

In AI model families, mutations are fast and directed. Two sibling models tend to resemble each other more than they resemble their shared parent.
August 14, 2025 at 3:06 PM
We measured “genetic similarity” between models from snippets of text - the metadata and model cards.

Models in the same finetuning family do resemble each other… but the evolution is weird. For example, traits drift in the same directions again and again.
August 14, 2025 at 3:06 PM
We reconstructed model family trees by tracing fine-tunes, adaptations, quantizations and merges.

Some trees are small: one parent, a few children. Others sprawl into thousands of descendants across ten+ generations.
August 14, 2025 at 3:06 PM
This is just the start. We hope our dataset & methods open the door to a science of AI ecosystems.

If you care about open-source AI, governance, or the weird ways technology evolves, give it a read.

📄Paper: arxiv.org/pdf/2508.06811
arxiv.org
August 14, 2025 at 2:59 PM
Big picture: By treating ML models like organisms in an ecosystem, we can:
🌱 Understand the pressures shaping AI development
🔍 Spot patterns before they become industry norms
🛠 Inform governance & safety strategies grounded in real data
August 14, 2025 at 2:59 PM
We found optimal evolutionary orderings over traits:
🔹 Feature extraction tends to be upstream from text generation. Text generation is upstream from text classification.
🔹 Certain license types precede others (e.g., llama3 → apache-2.0)
Here we show the top-20 licenses transitions over fine-tunes.
August 14, 2025 at 2:59 PM