Ayush Thakur
ayushthakur.bsky.social
Ayush Thakur
@ayushthakur.bsky.social
MLE @ Weights and Biases
Launched a course on evaluating LLM based applications: wandb.ai/site/courses...

Enjoy. 😄
LLM Apps: Evaluation
Develop techniques for building, optimizing, and scaling AI evaluators with minimal human input. Learn to build reliable evaluation pipelines for LLM applications by combining programmatic checks with...
wandb.ai
January 13, 2025 at 6:17 PM
Back in the days, WMT14 en-de dataset with 400k training samples was used a lot for NMT tasks. The reason for that is German is morphologically richer than other subsets in that benchmark.
November 25, 2024 at 10:55 AM
Have been working on a "LLM system" robustness metric "scorer".

Turns out your statistical metrics like Cohen's d and Cohen's h are really good to quantify robustness.

Cohen's h is especially good when system's output is binary.
November 25, 2024 at 10:51 AM
ML
November 20, 2024 at 6:40 AM