anmorgan.bsky.social
@anmorgan.bsky.social
I used SelfCheckGPT to automatically evaluate my outputs using @comet.com's Opik, a free, 100% open-source LLM evaluation framework.

⭐ Check it out and give it a star if you like what you see: github.com/comet-ml/opik (11/11)
GitHub - comet-ml/opik: Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards. - comet-ml/opik
github.com
March 27, 2025 at 4:15 PM
Learn how to code SelfCheckGPT from scratch, and use it to automatically evaluate your #LLM application in the full-code Colab:

colab.research.google.com/drive/1E5yEq... (10/11)
Google Colab
colab.research.google.com
March 27, 2025 at 4:15 PM
These methods allow SelfCheckGPT to detect hallucinations without any external fact-checking tools—just by analyzing the model’s own response patterns.

Check out the full breakdown by in my new article:
🔗 bit.ly/4iMxZbs (9/11)
SelfCheckGPT for LLM Evaluation
SelfCheckGPT analyzes divergences in output across multiple stochastic LLM runs, leveraging response variability to detect hallucinations
bit.ly
March 27, 2025 at 4:15 PM
5⃣ SelfCheckGPT with LLM Prompting:

▪️Ask the LLM itself to evaluate its own responses.
▪️Can the model detect contradictions in its own outputs?
▪️Self-reflection as a consistency check! (8/11)
March 27, 2025 at 4:15 PM
4⃣SelfCheckGPT with NLI:

▪️NLI model (DeBERTa-v3-large) classifies the relationship between sampled responses and original as either entailment, neutral, or contradiction
▪️The higher the contradiction score, the more likely it's a hallucination. (7/11)
March 27, 2025 at 4:15 PM
3️⃣ N-gram Probability Analysis:

▪️Train an n-gram model on multiple sampled responses.
▪️Sentences with higher log-probabilities are more reliable.
▪️Low probability = higher chance of hallucination. (6/11)
March 27, 2025 at 4:15 PM
2️⃣ Question Answering (QA) Check:

▪️Convert generated text into multiple-choice questions.
▪️If the model can’t answer consistently across samples, it suggests low factual reliability. (5/11)
March 27, 2025 at 4:15 PM
1️⃣ BERTScore Comparison:

▪️Compare multiple model-generated responses to a query.
▪️Higher BERTScore similarity = more reliable output.
▪️If responses contradict each other, it’s a red flag.(4/11)
March 27, 2025 at 4:15 PM
How does it work?

If an #LLM knows a fact, its responses to the same query should be consistent. If not, inconsistencies may signal potential hallucinations.

This paper outlines five key methods to quantify this. 👇
arxiv.org/abs/2303.08896 (3/11)
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-fac...
arxiv.org
March 27, 2025 at 4:15 PM
Traditional #AI evaluation methods often require:

❌ Access to model internals
❌ External fact-checking tools or databases
❌ References

But what if you can’t access these? SelfCheckGPT relies purely on self-consistency! (2/11)
March 27, 2025 at 4:15 PM
These methods allow SelfCheckGPT to detect hallucinations without any external fact-checking tools—just by analyzing the model’s own response patterns.

Check out the full breakdown by in my new article:
🔗 bit.ly/4iMxZbs (9/11)
SelfCheckGPT for LLM Evaluation
SelfCheckGPT analyzes divergences in output across multiple stochastic LLM runs, leveraging response variability to detect hallucinations
bit.ly
March 27, 2025 at 4:12 PM
5⃣ SelfCheckGPT with LLM Prompting:

▪️Ask the LLM itself to evaluate its own responses.
▪️Can the model detect contradictions in its own outputs?
▪️Self-reflection as a consistency check! (8/11)
March 27, 2025 at 4:12 PM
4⃣SelfCheckGPT with NLI:

▪️NLI model (DeBERTa-v3-large) classifies the relationship between sampled responses and original as either entailment, neutral, or contradiction
▪️The higher the contradiction score, the more likely it's a hallucination. (7/11)
March 27, 2025 at 4:12 PM
3️⃣ N-gram Probability Analysis:

▪️Train an n-gram model on multiple sampled responses.
▪️Sentences with higher log-probabilities are more reliable.
▪️Low probability = higher chance of hallucination. (6/11)
March 27, 2025 at 4:12 PM
2️⃣ Question Answering (QA) Check:

▪️Convert generated text into multiple-choice questions.
▪️If the model can’t answer consistently across samples, it suggests low factual reliability. (5/11)
March 27, 2025 at 4:12 PM
1️⃣ BERTScore Comparison:

▪️Compare multiple model-generated responses to a query.
▪️Higher BERTScore similarity = more reliable output.
▪️If responses contradict each other, it’s a red flag.(4/11)
March 27, 2025 at 4:12 PM
How does it work?

If an #LLM knows a fact, its responses to the same query should be consistent. If not, inconsistencies may signal potential hallucinations.

This paper outlines five key methods to quantify this. 👇
arxiv.org/abs/2303.08896 (3/11)
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-fac...
arxiv.org
March 27, 2025 at 4:12 PM
Traditional #AI evaluation methods often require:

❌ Access to model internals
❌ External fact-checking tools or databases
❌ References

But what if you can’t access these? SelfCheckGPT relies purely on self-consistency! (2/11)
March 27, 2025 at 4:12 PM
I used the LLM Jury to automatically evaluate my outputs using @comet.com's Opik, a free, 100% open-source LLM evaluation framework.

⭐️ Check it out and give it a star if you like what you see: (6/6)

github.com/comet-ml/opik
GitHub - comet-ml/opik: Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards. - comet-ml/opik
github.com
February 24, 2025 at 6:31 PM
Aligning the inputs and outputs to these diverse models is made super simple by using @openrouter.bsky.social, a unified API that gives you access to hundreds of AI models through a single endpoint. (5/6)

Check out the full-code Colab to get started: colab.research.google.com/drive/1Lt-4r...
Google Colab
colab.research.google.com
February 24, 2025 at 6:31 PM
In my new article, I code an LLM Jury from scratch using gpt-4o-mini, @mistralai.bsky.social's mistral-small-24b-instruct-2501 and @cohere.com's command-r-08-2024

Then I use it to evaluate the output of
@alibabagroup.bsky.social's Qwen2.5-3B-Instruct: (4/6)

www.comet.com/site/blog/ll...
LLM Juries for Evaluation
An LLM Jury consists of multiple LLM judges that independently score a given output, then aggregate their scores through a voting function.
www.comet.com
February 24, 2025 at 6:31 PM
Research from
@cohere.com suggests that a diverse panel of smaller models outperforms a single large judge, reduces bias, and does so at over 7x lower cost.

Plus, multiple smaller models can run in parallel, further improving speed and efficiency. (3/6)

arxiv.org/abs/2404.18796
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properti...
arxiv.org
February 24, 2025 at 6:31 PM