The fundamental problem with data contamination is not necessarily cheating. It is rather the fact that contamination may camouflage true generalization, which is what is needed in real-world applications. (6/N)
November 5, 2025 at 8:44 AM
The fundamental problem with data contamination is not necessarily cheating. It is rather the fact that contamination may camouflage true generalization, which is what is needed in real-world applications. (6/N)
Beyond detecting contamination, CoDeC is also a useful research tool for better understanding robustness and generalization properties of models and study their sensitivity to over reliance on memorization patterns. (5/N)
November 5, 2025 at 8:44 AM
Beyond detecting contamination, CoDeC is also a useful research tool for better understanding robustness and generalization properties of models and study their sensitivity to over reliance on memorization patterns. (5/N)
CoDeC uses this consistent observation to detect whether a model might have been contaminated with common benchmarks. Through controlled experiments, we show that the method is a reliable detector at the benchmark level and it can be used in early training to analyze accidental contamination. (4/N)
November 5, 2025 at 8:44 AM
CoDeC uses this consistent observation to detect whether a model might have been contaminated with common benchmarks. Through controlled experiments, we show that the method is a reliable detector at the benchmark level and it can be used in early training to analyze accidental contamination. (4/N)
In contrary, if the model has already seen the benchmark before (aka the benchmark might have been memorized), model confidence does not improve and it might even drop. (3/N)
November 5, 2025 at 8:44 AM
In contrary, if the model has already seen the benchmark before (aka the benchmark might have been memorized), model confidence does not improve and it might even drop. (3/N)
More precisely, if you want to solve problem X from a benchmark, and you include a few other example problems in context from the same benchmark, models that have never seen the benchmark in training benefit from seeing the in-context examples which increase the model's confidence. (2/N)
November 5, 2025 at 8:44 AM
More precisely, if you want to solve problem X from a benchmark, and you include a few other example problems in context from the same benchmark, models that have never seen the benchmark in training benefit from seeing the in-context examples which increase the model's confidence. (2/N)
…the list continues but point is that a company that hires the best talent in the field definitely knows how to chart. Problem arises when marketing drives and dominates the science, and it is not a single company problem today.
August 9, 2025 at 8:22 PM
…the list continues but point is that a company that hires the best talent in the field definitely knows how to chart. Problem arises when marketing drives and dominates the science, and it is not a single company problem today.
…coloring new model releases boldly while leaving the older models as blank/white so newer models artificially stand out even if they’re not better, not providing worst case results, not standardizing the max value across charts presented at the same level horizontally…
August 9, 2025 at 8:21 PM
…coloring new model releases boldly while leaving the older models as blank/white so newer models artificially stand out even if they’re not better, not providing worst case results, not standardizing the max value across charts presented at the same level horizontally…