Lightnews — Scholar-powered news

Caleb Fahlgren

@calebfahlgren.hf.co

You can just ask things 🗣️

"show me messages in the coding category that are in the top 10% of reward model scores"

Download really high quality instructions from the Argilla Llama3.1 405B synthetic dataset 🔥

December 4, 2024 at 8:54 AM

Reposted by Caleb Fahlgren

Thomas Wolf

@thomwolf.bsky.social

Most liked and most downloaded open-source AI models from 2022 to 2024

Interactive viz: aiworld.eu/embed/model/...
Discussion: huggingface.co/spaces/huggi...

December 4, 2024 at 8:37 AM

Caleb Fahlgren

@calebfahlgren.hf.co

The amazing, new Qwen2.5-Coder 32B model can now write SQL for any @hf.co dataset ✨

December 2, 2024 at 12:48 PM

Caleb Fahlgren

@calebfahlgren.hf.co

This is insane! Structured generation in the browser with the new @hf.co SmolLM2-1.7B model

• Tiny 1.7B LLM running at 88 tokens / second ⚡
• Powered by MLC/WebLLM on WebGPU 🔥
• JSON Structured Generation entirely in the browser 🤏

November 29, 2024 at 11:18 AM

Reposted by Caleb Fahlgren

Thomas Wolf

@thomwolf.bsky.social

Releasing SmolVLM, a small 2 billion parameters Vision+Language Model (VLM) built for on-device/in-browser inference with images/videos.

Outperforms all models at similar GPU RAM usage and tokens throughputs

Blog post: huggingface.co/blog/smolvlm

November 26, 2024 at 4:58 PM

Caleb Fahlgren

@calebfahlgren.hf.co

The OpenLLM Leaderboard just passed 2k evals 🥳

Here's a look at the distribution of average scores for all those models!

Great work by the @huggingface.bsky.social team to do these evals!

November 26, 2024 at 12:55 PM

Caleb Fahlgren

@calebfahlgren.hf.co

Automatically tracking all Ollama requests to a dataset with the new observers python library!

With just a few lines of code all your requests can be sent to @huggingface.bsky.social datasets for annotating, analysis and observability 🔭

November 21, 2024 at 8:12 PM

Caleb Fahlgren

@calebfahlgren.hf.co

observers 🔭 - automatically log all OpenAI compatible requests to a dataset 💽

• supports any OpenAI compatible endpoint 💪
• supports @duckdb.org, @huggingface.bsky.social datasets and Argilla as stores

> pip install observers

November 21, 2024 at 8:06 PM

Caleb Fahlgren

@calebfahlgren.hf.co

SmolTalk is out 🗣️

Over 1M high quality instructions used for training SmolLM2, one of the best small language models in the industry.

huggingface.co/datasets/Hug...

HuggingFaceTB/smoltalk · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

November 21, 2024 at 2:56 PM

Reposted by Caleb Fahlgren

David Berenstein

@davidberenstein.bsky.social

Observers: A Lightweight SDK for AI Observability

TLDR;
- Track and record interactions with AI models
- Store observations in multiple backends @huggingface.bsky.social, @duckdb.org or Argilla
- Query and analyse your AI interactions with ease

GitHub:
github.com/cfahlgren1/o...

November 21, 2024 at 10:29 AM

Reposted by Caleb Fahlgren

Simon Willison

@simonwillison.net

Foursquare just open sourced their 100 million place point of interest dataset! Some notes on poking around with it using DuckDB (it's Parquet files on S3) simonwillison.net/2024/Nov/20/...

Foursquare Open Source Places: A new foundational dataset for the geospatial community

I did not expect this! > [...] we are announcing today the general availability of a foundational open data set, Foursquare Open Source Places ("FSQ OS Places"). This base layer …

simonwillison.net

November 20, 2024 at 6:08 AM

Caleb Fahlgren

@calebfahlgren.hf.co

Range requests + Parquet is what makes the Hugging Face SQL Console possible to query datasets entirely in the browser

Simon Willison @simonwillison.net · Nov 20

HTTP range requests are one of those interesting technologies that have almost universal support without most people even being aware of them - because you can't do streaming video formats with the ability to skip ahead without them

Simon Willison @simonwillison.net · Nov 20

Because it can run queries directly against Parquet files over HTTP without needing to download the whole file - it uses HTTP range requests to fetch just the bits of the file it needs to answer the question

November 21, 2024 at 6:59 AM

Reposted by Caleb Fahlgren

archie.md

@archie.sarrewood.com

duckdb-gsheets v0.0.3 is out, courtesy of @a13x.bsky.social

the power is terrifying! duckdb-gsheets.com

November 21, 2024 at 3:51 AM

Reposted by Caleb Fahlgren

jsulz

@jsulz.com

When XetHub joined Hugging Face, we brainstormed how to share our tech with the community.

The magic? Versioning chunks, not files, giving rise to:

🧠 Smarter storage
⏩ Faster uploads
🚀 Efficient downloads

Curious? Read the blog and let us know how it could help your workflows!