Esports stuff for fun:
https://cthorrez.github.io/riix/riix.html
https://huggingface.co/datasets/EsportsBench/EsportsBench
huggingface.co/datasets/Esp...
huggingface.co/datasets/Esp...
Should "A ≻ B" mean:
"A is preferred to B (higher rating)"
"B is preferred to A (lower rank number)"?
arxiv.org/pdf/2411.049...
www.tandfonline.com/doi/full/10....
Should "A ≻ B" mean:
"A is preferred to B (higher rating)"
"B is preferred to A (lower rank number)"?
arxiv.org/pdf/2411.049...
www.tandfonline.com/doi/full/10....
basically my rules of thumb are to never use numpy on scalars unless the function simply doesn't exist in base python, and to try the simpler thing, ** and pow are general and need to support raising numbers to any power, num*num is a single multiplication
basically my rules of thumb are to never use numpy on scalars unless the function simply doesn't exist in base python, and to try the simpler thing, ** and pow are general and need to support raising numbers to any power, num*num is a single multiplication
Well for the largest Qwen3, the answer is -28 points
Thinking on academic benchmarks seems to help a lot, I wonder what's going wrong in the arena?
Maybe people can sense the hedging and don't like it, or it poisons its own context with overthinking
Well for the largest Qwen3, the answer is -28 points
Thinking on academic benchmarks seems to help a lot, I wonder what's going wrong in the arena?
Maybe people can sense the hedging and don't like it, or it poisons its own context with overthinking
None of them are to copy the code
None of them are to copy the code
I think if I'm reading it right 0.363 is the white advantage when when scaled to Elo space is ~63 points, sounds significant to me.
I think if I'm reading it right 0.363 is the white advantage when when scaled to Elo space is ~63 points, sounds significant to me.
Google thinks TrueSkill is in Russian!
scholar.google.com/scholar?hl=e...
Google thinks TrueSkill is in Russian!
scholar.google.com/scholar?hl=e...
Is that just my subjective disagreement from the original grader?
Is that just my subjective disagreement from the original grader?
www.wolframalpha.com/input?i=%281...
www.wolframalpha.com/input?i=%281...