Lightnews — Scholar-powered news

Yotam Perlitz

@yperlitz.bsky.social

How important are LLM evaluations to you?

A) Who cares?
B) Somewhat important (I guess?)
C) I'm an LLM, I evaluate myself.
D) Enough to join the pack

Lets talk about LLM evals here: go.bsky.app/DJpp8cy

November 18, 2024 at 8:50 PM

Yotam Perlitz

@yperlitz.bsky.social

Save yourselves the hours (or days) inferring all 64K examples, when using HELM
In arxiv.org/pdf/2308.116... we show that 160 examples 🤯🤯🤯 is enough to get a very good picture, #ComputeIsForTraining.

with
@lchoshen.bsky.social and more

November 13, 2024 at 6:40 PM

Yotam Perlitz

@yperlitz.bsky.social

If you haven't tried it yet:
github.com/yamadashy/re...
will can turn your repo into one file,
making it super easy to feed to a chatbot asking questions

GitHub - yamadashy/repomix: 📦 Repomix (formerly Repopack) is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large...

📦 Repomix (formerly Repopack) is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large Language Models (LLMs) o...

github.com

November 12, 2024 at 7:50 PM

Yotam Perlitz

@yperlitz.bsky.social

✨ Developed a new benchmark or dataset for language models? ✨
Want the community to trust and adopt it? 🤔
Show that it (dis)agrees with common benchmarks

BenchBench makes it easy. Check it out:
👉 huggingface.co/spaces/ibm/b...

BenchBench Leaderboad - a Hugging Face Space by ibm

Discover amazing ML apps made by the community

huggingface.co

November 12, 2024 at 7:47 PM

Yotam Perlitz

@yperlitz.bsky.social

Seems like it indeed measure what it claims to :)
Kudus to the authors
A faster, automatic (no annotators) alternative to the Chatbot arena https://t.co/WNk3UmXRSq

November 19, 2024 at 7:27 PM

Yotam Perlitz

@yperlitz.bsky.social

https://t.co/TZlMiQdgWR

November 19, 2024 at 7:27 PM

Yotam Perlitz

@yperlitz.bsky.social

we've now added the decentralized arena to benchbench,

check out how it fares with other benchmarks

https://t.co/pjhtr8CPZD

November 19, 2024 at 7:27 PM

Yotam Perlitz

@yperlitz.bsky.social

Get your benchmark game on: https://t.co/yY0swLQOHZ https://t.co/3qzkcIOd7u https://t.co/5Y7QUz0Ype

November 19, 2024 at 7:27 PM

Yotam Perlitz

@yperlitz.bsky.social

Me trying to choose the right LLM benchmark without BenchBench:

https://t.co/TZlMiQdgWR https://t.co/DQEttklUGQ

November 19, 2024 at 7:27 PM

Yotam Perlitz

@yperlitz.bsky.social

Shoutout to @streamlit, our framework of choice! Shoutout to @huggingface for hosting our space 🤗 https://t.co/z8LFw6ZQG7

November 19, 2024 at 7:27 PM

Yotam Perlitz

@yperlitz.bsky.social

Explore the BenchBench Leaderboard to explore and visualize how established benchmarks compare: https://t.co/yY0swLQgSr
Use our Python package to perform your own BAT analysis: https://t.co/iU8favWVT6
And read the paper: https://t.co/RvCp3R6gU5 https://t.co/poHpewZkS3

November 19, 2024 at 7:27 PM

Yotam Perlitz

@yperlitz.bsky.social

BenchBench can prove your benchmark measures unique skills ❄️(disagreement with existing benchmarks)

Or prove it captures the essence of others aimed at (agreement), for example, agreeing with @lmsys, but efficiently. https://t.co/KwtHtTRESc

November 19, 2024 at 7:27 PM

Yotam Perlitz

@yperlitz.bsky.social

✨ Developed a new benchmark or dataset for language models? ✨

Want the community to trust and adopt it? 🤔

So, demonstrate its validity by comparing it to established benchmarks!

BenchBench makes it easy. Check it out:
👉 https://t.co/yY0swLQgSr

November 19, 2024 at 7:27 PM

Yotam Perlitz

@yperlitz.bsky.social

Shout-out to the amazing team at IBM behind Unitxt: @ElronBandel, @MatanOrbach, yoavkatz, eladv, @LChoshen, @yotamperlitz & more!

IBM is betting big on it (IBM Research AI VP 👇) https://t.co/BKfK0JriYB

November 19, 2024 at 7:28 PM

Yotam Perlitz

@yperlitz.bsky.social

HELM just got a great upgrade!
We've integrated with Unitxt for:

Easy dataset addition
2x the datasets
Sharable & reproducible pipelines

Check out the blogpost: https://t.co/UJXwfPKzGN
And the unitxt repo
https://t.co/GeqMCoQhjv

@ElronBandel @YifanMai

November 19, 2024 at 7:28 PM

Yotam Perlitz

@yperlitz.bsky.social

Everyone knows you never have to use the full test set
We shows how much they were right 🤯!

Check out our presentation at @naacl
in Efficient/Low-Resources and Evaluation Methods for NLP (18 June 2024 @ 02:12)

or watch our video here:
https://t.co/pPOpKyLbhT

See you! https://t.co/ocVvmVBBlW

November 19, 2024 at 7:28 PM

Yotam Perlitz

@yperlitz.bsky.social

It is a great figure
and a great thing you did by sharing all your meta-data!

it had enabled a lot of great work
ours as well :)

https://t.co/9lGi8aW8IG https://t.co/Lz62fTdn7O

November 19, 2024 at 7:28 PM

Yotam Perlitz

@yperlitz.bsky.social

Bored with all benchmarks ranking models the same?
HOLMES doesn't 💪

Probing LMs for linguistic abilities is a fresh idea, @AndreasWaldis took it to the extreme 🦸

Give it a read!
or check out the leaderboard https://t.co/Byc1Nhp3nV https://t.co/zH0RLddkID

November 19, 2024 at 7:28 PM

Yotam Perlitz

@yperlitz.bsky.social

I've been working internally with this dataset
and let me tell you...

Its great! https://t.co/MOwn0OyVS3

November 19, 2024 at 7:28 PM

Yotam Perlitz

@yperlitz.bsky.social

like the color scheme 🏅 https://t.co/sdAosgxypV

November 19, 2024 at 7:28 PM

Yotam Perlitz

@yperlitz.bsky.social

Using contrastive representation for optimized human evaluation 👁️👁️👁️

Nice! https://t.co/49leLodOAQ

November 19, 2024 at 7:28 PM

Yotam Perlitz

@yperlitz.bsky.social

Check out the paper for more insights :) https://t.co/7zhb8mGtQ0

November 19, 2024 at 7:28 PM

Yotam Perlitz

@yperlitz.bsky.social

variance in evaluation has many sources,
this work really does a good job at profiling one of these https://t.co/nAf7zYDSd7

November 19, 2024 at 7:28 PM

Yotam Perlitz

@yperlitz.bsky.social

these models keeps changing 💩
tomorrow this figure will have no meaning https://t.co/OsA2WfiLHn

November 19, 2024 at 7:28 PM

Yotam Perlitz

@yperlitz.bsky.social

this is a nice to have link :) https://t.co/DYApcasZen

November 19, 2024 at 7:28 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news