Philipp Schmid
philschmid.bsky.social
Philipp Schmid
@philschmid.bsky.social
Tech Lead and LLMs at @huggingface 👨🏻‍💻 🤗 AWS ML Hero 🦸🏻 | Cloud & ML enthusiast | 📍Nuremberg | 🇩🇪 https://philschmid.de
Pinned
Hello, my name is Philipp. I am a Technical Lead at @huggingface.bsky.social, leading our partnerships with AWS, Google, Azure, or NVIDIA. 🧑🏻‍💻

I post about the AI News, Open Models, Interesting AI Paper Summaries, blog posts, and guides!

My is blog at www.philschmid.de

Make sure to follow! 🤗
Philschmid
Personal Blog of Philipp Schmid Technical Lead and LLM at Hugging Face. Learn how to use the latest AI and Cloud Technologies from fine-tuning LLMs with RLHF to deploying them in production.
www.philschmid.de
Code and methods open source in a new library ,“learn and search”
Blog: huggingface.co/spaces/Huggi...

Learn and Search Repo: github.com/huggingface/...
Scaling test-time compute - a Hugging Face Space by HuggingFaceH4
Discover amazing ML apps made by the community
huggingface.co
December 17, 2024 at 7:30 AM
- Introduce DVTS, a new method of performance on larger compute budgets by maintaining solution diversity
- Using compute-optimal scaling, a Llama 3 3B outperforms 70B (22x larger) on mathematical reasoning tasks
December 17, 2024 at 7:30 AM
- Process Reward Models (PRMs) played a crucial role in the search process by evaluating intermediate solution steps
- Different search strategies work better for different problem difficulties - beam search for harder problems, Best-of-N for simpler ones
December 17, 2024 at 7:30 AM
- Test-time compute scaling offers an alternative to training larger models by allowing smaller models to "think longer"
- Explored Best-of-N sampling, beam search, and Diverse Verifier Tree Search (DVTS)
- Llama 3 1B achieved 55% accuracy on the MATH benchmark using optimal search strategies
December 17, 2024 at 7:30 AM
By scaling test-time compute, smaller models can match or even surpass the performance of larger models. Llama 3.2 3B can outperform Llama 3.1 70B on MATH-500!🤯
December 17, 2024 at 7:30 AM
How we implemented test-time computing for open models to solve complex math problems like OpenAI o1. 👀 Test-time compute methods use dynamic inference strategies to have LLMs “think longer” on harder problems, e.g. difficult math problems.
December 17, 2024 at 7:30 AM
- 🛠️ Cuts down costs to ~2.29% and time to ~2.36% of human evaluation
- 💰 Costs $30 vs $1,297 for human evaluation
- ⚡ Reduced time to 118.43 minutes vs 86.5 hours
- 🧑‍⚖️ LLM achieved a 60-70% alignment rate to humans
- 🥇 Agent achieved a 90% alignment rate to humans

huggingface.co/datasets/DEV...
DEVAI-benchmark/DEVAI · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
December 10, 2024 at 9:53 AM
The Agent-as-a-Judge is a graph-based agent with tools to locate, read, retrieve, and evaluate files and information for a code project to evaluate the results of other agents by comparing its judgments to human evaluations (alignment rate, judge shift).

Github: github.com/metauto-ai/a...
December 10, 2024 at 9:53 AM
What is better than an LLM as a Judge? Right, an Agent as a Judge! Meta created an Agent-as-a-Judge to evaluate code agents to enable intermediate feedback alongside DevAI a new benchmark of 55 realistic development tasks.

Paper: huggingface.co/papers/2410....
Paper page - Agent-as-a-Judge: Evaluate Agents with Agents
Join the discussion on this paper page
huggingface.co
December 10, 2024 at 9:53 AM
Sora UI: sora.com

Kudos to OpenAI for shipping this! The UI/UX looks really thorough! 🚢
Sora
Transform text and images into immersive videos. Animate stories, visualize ideas, and bring your concepts to life.
sora.com
December 9, 2024 at 6:41 PM

OpenAI trained a new Turbo model to make it easier and faster to use. With "storyboards" users get a CapCut/Tiktok/Reel-like text-to-video editor, that can be used to edit and create new short-form content! Social media will be flooded.🌊
December 9, 2024 at 6:41 PM
A big day for AI and sad day for the EU. OpenAI releases Sora, their text-to-video model, with a dedicated UI Studio! Sora will be free for all ChatGPT Pro and Plus subscribers without additional cost. Sora will be available to later today, except if you live in the EU or UK. 🤯
December 9, 2024 at 6:41 PM
- ⚠️ notable limitations including language mixing, recursive reasoning loops, and safety considerations
- 😍 Released under Apache 2.0 on Hugging Face
- 👀 Full “reasoning” (CoT) available in the demo
November 28, 2024 at 8:01 AM
- 👨‍🔬 QwQ-32B-Preview is an experimental research
- 🔧 32.5B parameters and 32,768 context length
- 📊 65.2% on GPQA, 50.0% on AIME, 90.6% on MATH-500, and 50.0% on LiveCodeBench
November 28, 2024 at 8:01 AM
First open-weights for OpenAI-o1-like reasoning model! QwQ from the Qwen team is a 32B model that beats OpenAI O1 mini and competes w/ O1 preview and is available under Apache 2.0 on Hugging Face! 🤯
November 28, 2024 at 8:01 AM
🎥 Surprising video capabilities with 27.14% on CinePile
🔓 Released under Apache 2.0 on @huggingface.bsky.social
📱 Can run efficiently on laptops and edge devices
November 26, 2024 at 4:31 PM
🚀 Smallest SOTA vision language model at only 2B parameters
🛠️ Released 3 variants with Base, Synthetic, and Instruct
💾 Requires only 5GB GPU RAM and achieves 38.8% on MMMU, 81.6% on DocVQA
⚡ 3.3-4.5x faster prefill and 7.5-16x faster generation vs Qwen2-VL
November 26, 2024 at 4:31 PM
SmolLM can now see! 👀 Meet SmolVLM - a tiny 2B but powerful vision language model that runs on your device! Built on top of SmolLM and released under Apache 2.0. 🚀
November 26, 2024 at 4:31 PM
Blog: neuralmagic.com/blog/24-spar...
Pruning is not a new technique, but it was much harder to achieve good results and maintain performance across tasks compared to quantization. Let's see if Neural Magic can change that.
2:4 Sparse Llama: Smaller Models for Efficient GPU Inference
Discover Sparse Llama: A 50% pruned, GPU-optimized Llama 3.1 model with 2:4 sparsity, enabling faster, cost-effective inference without sacrificing accuracy.
neuralmagic.com
November 26, 2024 at 8:24 AM
- 📈 Full recovery on fine-tuning tasks (GSM8K, Evol-CodeAlpaca, Ultrachat-200K)
- ⚡ 1.4-2.1x better multi-query throughput
- 🌱 Pruned using 13B tokens training, 26 hours on 32 H100s
- 🔧 Optimized for NVIDIA Ampere GPUs and newer
November 26, 2024 at 8:24 AM
- 🔄 98.4% original accuracy on on Open LLM Leaderboard v1 with 50% less parameters using 2:4 sparsity pattern
- 🚀 30% higher throughput and 1.8x lower latency with up to 5.0x when combined with quantization
- 💻 Works with 4-bit quantization (GPTQ) and Sparse-Marlin kernels
November 26, 2024 at 8:24 AM
How far can we push LLM optimizations? Turns out, pretty far! A new study achieves 98% accuracy recovery on key benchmarks while removing 50% of Llama 3.1 8B's parameters using pruning. Pruning strategically to remove unnecessary connections in a neural network to make it smaller and faster. 👀
November 26, 2024 at 8:24 AM
TIL: @huggingface.bsky.social Transformers has native Tensor Parallelism support for better inference on multiple GPUs! This will enable many benefits and optimizations in the future.🚀

For now, it supports Llama. Which one would you want to see next?
November 25, 2024 at 3:50 PM