Nikolay Bogoychev
banner
xapajiamnu.bsky.social
Nikolay Bogoychev
@xapajiamnu.bsky.social
Research Scientist at Meta.
LLMs, neural networks, logographic writing systems.

https://nbogoychev.com
[2/2]
This is part of Kaggle benchmarking initiative, spearingheading the effort for trustworthy and rigorous LLM evaluation!
www.kaggle.com/blog/announc...
Introducing Kaggle Benchmarks | Kaggle
Democratizing trustworthy, rigorous evaluation for the GenAI era
www.kaggle.com
July 23, 2025 at 1:04 PM
[5/5] Paper: arxiv.org/abs/2504.10356
Dataset, code, leaderboard: github.com/facebookrese...
April 15, 2025 at 1:03 PM
[4/5] We release a dev partition which you can download and use. We also have a secret test partition, which will not be released at this point in time to avoid accidental contamination. If you upload your models on huggingface, we can evaluate them for you and put you on our leaderboard!
April 15, 2025 at 1:02 PM
[3/5] Our dataset shows that in general, LLMs answer questions better if they are asked in the language corresponding to the culture where the knowledge originated, showing that there is a lot of growth potential when it comes to cross lingual knowledge transfer.
April 15, 2025 at 1:01 PM
[2/5] Bulgarians are much more likely to ask an LLM about Tsar Simeon, rather than queen Elizabeth.

We also provide translations to and from English of our dataset so that we can measure crosslingual knowledge transfer.
April 15, 2025 at 1:00 PM
Absolutely. I am just looking for a way to make scaling laws map to actual task metrics. At the moment we have the situation of "look at how nice the nll goes down on task X when we scale the model", but no idea what the actual downstream task metric will be or when it'd saturate. Maybe impossible
February 27, 2025 at 12:23 AM
We should get rid of those auxiliary tasks it shoehorns everybody into nll for scaling laws...
February 26, 2025 at 6:06 PM
Pretty cool work! Did you also see neurips.cc/virtual/2023... ?

It would be nice if we can also make a figure out more linear metrics for evaluating popular LLM tasks...
NeurIPS Poster Are Emergent Abilities of Large Language Models a Mirage?NeurIPS 2023
neurips.cc
February 26, 2025 at 7:39 AM