Caleb Fahlgren
calebfahlgren.hf.co
Caleb Fahlgren
@calebfahlgren.hf.co
SWE @hf.co
You can just ask things 🗣️

"show me messages in the coding category that are in the top 10% of reward model scores"

Download really high quality instructions from the Argilla Llama3.1 405B synthetic dataset 🔥
December 4, 2024 at 8:54 AM
Reposted by Caleb Fahlgren
Most liked and most downloaded open-source AI models from 2022 to 2024

Interactive viz: aiworld.eu/embed/model/...
Discussion: huggingface.co/spaces/huggi...
December 4, 2024 at 8:37 AM
The amazing, new Qwen2.5-Coder 32B model can now write SQL for any @hf.co dataset ✨
December 2, 2024 at 12:48 PM
This is insane! Structured generation in the browser with the new @hf.co SmolLM2-1.7B model

• Tiny 1.7B LLM running at 88 tokens / second ⚡
• Powered by MLC/WebLLM on WebGPU 🔥
• JSON Structured Generation entirely in the browser 🤏
November 29, 2024 at 11:18 AM
Reposted by Caleb Fahlgren
Releasing SmolVLM, a small 2 billion parameters Vision+Language Model (VLM) built for on-device/in-browser inference with images/videos.

Outperforms all models at similar GPU RAM usage and tokens throughputs

Blog post: huggingface.co/blog/smolvlm
November 26, 2024 at 4:58 PM
The OpenLLM Leaderboard just passed 2k evals 🥳

Here's a look at the distribution of average scores for all those models!

Great work by the @huggingface.bsky.social team to do these evals!
November 26, 2024 at 12:55 PM
Automatically tracking all Ollama requests to a dataset with the new observers python library!

With just a few lines of code all your requests can be sent to @huggingface.bsky.social datasets for annotating, analysis and observability 🔭
November 21, 2024 at 8:12 PM
observers 🔭 - automatically log all OpenAI compatible requests to a dataset 💽

• supports any OpenAI compatible endpoint 💪
• supports @duckdb.org, @huggingface.bsky.social datasets and Argilla as stores

> pip install observers
November 21, 2024 at 8:06 PM
SmolTalk is out 🗣️

Over 1M high quality instructions used for training SmolLM2, one of the best small language models in the industry.

huggingface.co/datasets/Hug...
HuggingFaceTB/smoltalk · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
November 21, 2024 at 2:56 PM
Reposted by Caleb Fahlgren
Observers: A Lightweight SDK for AI Observability

TLDR;
- Track and record interactions with AI models
- Store observations in multiple backends @huggingface.bsky.social, @duckdb.org or Argilla
- Query and analyse your AI interactions with ease

GitHub:
github.com/cfahlgren1/o...
November 21, 2024 at 10:29 AM
Reposted by Caleb Fahlgren
Foursquare just open sourced their 100 million place point of interest dataset! Some notes on poking around with it using DuckDB (it's Parquet files on S3) simonwillison.net/2024/Nov/20/...
Foursquare Open Source Places: A new foundational dataset for the geospatial community
I did not expect this! > [...] we are announcing today the general availability of a foundational open data set, Foursquare Open Source Places ("FSQ OS Places"). This base layer …
simonwillison.net
November 20, 2024 at 6:08 AM
Range requests + Parquet is what makes the Hugging Face SQL Console possible to query datasets entirely in the browser
HTTP range requests are one of those interesting technologies that have almost universal support without most people even being aware of them - because you can't do streaming video formats with the ability to skip ahead without them
Because it can run queries directly against Parquet files over HTTP without needing to download the whole file - it uses HTTP range requests to fetch just the bits of the file it needs to answer the question
November 21, 2024 at 6:59 AM
Reposted by Caleb Fahlgren
duckdb-gsheets v0.0.3 is out, courtesy of @a13x.bsky.social

the power is terrifying! duckdb-gsheets.com
November 21, 2024 at 3:51 AM
Reposted by Caleb Fahlgren
When XetHub joined Hugging Face, we brainstormed how to share our tech with the community.

The magic? Versioning chunks, not files, giving rise to:

🧠 Smarter storage
⏩ Faster uploads
🚀 Efficient downloads

Curious? Read the blog and let us know how it could help your workflows!
From Files to Chunks: Improving HF Storage Efficiency
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
November 20, 2024 at 6:51 PM
Life would be so easy if @duckdb.org had an LLMs.txt 🤩
llmstxt.org
The /llms.txt file – llms-txt
A proposal to standardise on using an /llms.txt file to provide information to help LLMs use a website at inference time.
llmstxt.org
November 20, 2024 at 10:57 AM