Lightnews — Scholar-powered news

anmorgan.bsky.social

@anmorgan.bsky.social

I used SelfCheckGPT to automatically evaluate my outputs using @comet.com's Opik, a free, 100% open-source LLM evaluation framework.

⭐ Check it out and give it a star if you like what you see: github.com/comet-ml/opik (11/11)

GitHub - comet-ml/opik: Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards. - comet-ml/opik

github.com

March 27, 2025 at 4:15 PM

anmorgan.bsky.social

@anmorgan.bsky.social

Learn how to code SelfCheckGPT from scratch, and use it to automatically evaluate your #LLM application in the full-code Colab:

colab.research.google.com/drive/1E5yEq... (10/11)

Google Colab

colab.research.google.com

March 27, 2025 at 4:15 PM

anmorgan.bsky.social

@anmorgan.bsky.social

These methods allow SelfCheckGPT to detect hallucinations without any external fact-checking tools—just by analyzing the model’s own response patterns.

Check out the full breakdown by in my new article:
🔗 bit.ly/4iMxZbs (9/11)

SelfCheckGPT for LLM Evaluation

SelfCheckGPT analyzes divergences in output across multiple stochastic LLM runs, leveraging response variability to detect hallucinations

bit.ly

March 27, 2025 at 4:15 PM

anmorgan.bsky.social

@anmorgan.bsky.social

5⃣ SelfCheckGPT with LLM Prompting:

▪️Ask the LLM itself to evaluate its own responses.
▪️Can the model detect contradictions in its own outputs?
▪️Self-reflection as a consistency check! (8/11)

March 27, 2025 at 4:15 PM

anmorgan.bsky.social

@anmorgan.bsky.social

4⃣SelfCheckGPT with NLI:

▪️NLI model (DeBERTa-v3-large) classifies the relationship between sampled responses and original as either entailment, neutral, or contradiction
▪️The higher the contradiction score, the more likely it's a hallucination. (7/11)

March 27, 2025 at 4:15 PM

anmorgan.bsky.social

@anmorgan.bsky.social

3️⃣ N-gram Probability Analysis:

▪️Train an n-gram model on multiple sampled responses.
▪️Sentences with higher log-probabilities are more reliable.
▪️Low probability = higher chance of hallucination. (6/11)

March 27, 2025 at 4:15 PM

anmorgan.bsky.social

@anmorgan.bsky.social

2️⃣ Question Answering (QA) Check:

▪️Convert generated text into multiple-choice questions.
▪️If the model can’t answer consistently across samples, it suggests low factual reliability. (5/11)

March 27, 2025 at 4:15 PM

anmorgan.bsky.social

@anmorgan.bsky.social

1️⃣ BERTScore Comparison:

▪️Compare multiple model-generated responses to a query.
▪️Higher BERTScore similarity = more reliable output.
▪️If responses contradict each other, it’s a red flag.(4/11)

March 27, 2025 at 4:15 PM

anmorgan.bsky.social

@anmorgan.bsky.social

How does it work?

If an #LLM knows a fact, its responses to the same query should be consistent. If not, inconsistencies may signal potential hallucinations.

This paper outlines five key methods to quantify this. 👇
arxiv.org/abs/2303.08896 (3/11)

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-fac...

arxiv.org

March 27, 2025 at 4:15 PM

anmorgan.bsky.social

@anmorgan.bsky.social

Traditional #AI evaluation methods often require:

❌ Access to model internals
❌ External fact-checking tools or databases
❌ References

But what if you can’t access these? SelfCheckGPT relies purely on self-consistency! (2/11)

March 27, 2025 at 4:15 PM

anmorgan.bsky.social

@anmorgan.bsky.social

These methods allow SelfCheckGPT to detect hallucinations without any external fact-checking tools—just by analyzing the model’s own response patterns.

Check out the full breakdown by in my new article:
🔗 bit.ly/4iMxZbs (9/11)

SelfCheckGPT for LLM Evaluation

SelfCheckGPT analyzes divergences in output across multiple stochastic LLM runs, leveraging response variability to detect hallucinations

bit.ly

March 27, 2025 at 4:12 PM

anmorgan.bsky.social

@anmorgan.bsky.social

5⃣ SelfCheckGPT with LLM Prompting:

▪️Ask the LLM itself to evaluate its own responses.
▪️Can the model detect contradictions in its own outputs?
▪️Self-reflection as a consistency check! (8/11)

March 27, 2025 at 4:12 PM

anmorgan.bsky.social

@anmorgan.bsky.social

4⃣SelfCheckGPT with NLI:

▪️NLI model (DeBERTa-v3-large) classifies the relationship between sampled responses and original as either entailment, neutral, or contradiction
▪️The higher the contradiction score, the more likely it's a hallucination. (7/11)

March 27, 2025 at 4:12 PM

anmorgan.bsky.social

@anmorgan.bsky.social

3️⃣ N-gram Probability Analysis:

▪️Train an n-gram model on multiple sampled responses.
▪️Sentences with higher log-probabilities are more reliable.
▪️Low probability = higher chance of hallucination. (6/11)

March 27, 2025 at 4:12 PM

anmorgan.bsky.social

@anmorgan.bsky.social

2️⃣ Question Answering (QA) Check:

▪️Convert generated text into multiple-choice questions.
▪️If the model can’t answer consistently across samples, it suggests low factual reliability. (5/11)

March 27, 2025 at 4:12 PM

anmorgan.bsky.social

@anmorgan.bsky.social

1️⃣ BERTScore Comparison:

▪️Compare multiple model-generated responses to a query.
▪️Higher BERTScore similarity = more reliable output.
▪️If responses contradict each other, it’s a red flag.(4/11)

March 27, 2025 at 4:12 PM

anmorgan.bsky.social

@anmorgan.bsky.social

How does it work?

If an #LLM knows a fact, its responses to the same query should be consistent. If not, inconsistencies may signal potential hallucinations.

This paper outlines five key methods to quantify this. 👇
arxiv.org/abs/2303.08896 (3/11)

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-fac...

arxiv.org

March 27, 2025 at 4:12 PM

anmorgan.bsky.social

@anmorgan.bsky.social

Traditional #AI evaluation methods often require:

❌ Access to model internals
❌ External fact-checking tools or databases
❌ References

But what if you can’t access these? SelfCheckGPT relies purely on self-consistency! (2/11)

March 27, 2025 at 4:12 PM

anmorgan.bsky.social

@anmorgan.bsky.social

I used the LLM Jury to automatically evaluate my outputs using @comet.com's Opik, a free, 100% open-source LLM evaluation framework.

⭐️ Check it out and give it a star if you like what you see: (6/6)

github.com/comet-ml/opik

GitHub - comet-ml/opik: Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards. - comet-ml/opik

github.com

February 24, 2025 at 6:31 PM

anmorgan.bsky.social

@anmorgan.bsky.social

Aligning the inputs and outputs to these diverse models is made super simple by using @openrouter.bsky.social, a unified API that gives you access to hundreds of AI models through a single endpoint. (5/6)

Check out the full-code Colab to get started: colab.research.google.com/drive/1Lt-4r...

Google Colab

colab.research.google.com

February 24, 2025 at 6:31 PM

anmorgan.bsky.social

@anmorgan.bsky.social

In my new article, I code an LLM Jury from scratch using gpt-4o-mini, @mistralai.bsky.social's mistral-small-24b-instruct-2501 and @cohere.com's command-r-08-2024

Then I use it to evaluate the output of
@alibabagroup.bsky.social's Qwen2.5-3B-Instruct: (4/6)

www.comet.com/site/blog/ll...

LLM Juries for Evaluation

An LLM Jury consists of multiple LLM judges that independently score a given output, then aggregate their scores through a voting function.

www.comet.com

February 24, 2025 at 6:31 PM

anmorgan.bsky.social

@anmorgan.bsky.social

Research from
@cohere.com suggests that a diverse panel of smaller models outperforms a single large judge, reduces bias, and does so at over 7x lower cost.

Plus, multiple smaller models can run in parallel, further improving speed and efficiency. (3/6)

arxiv.org/abs/2404.18796

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properti...

arxiv.org

February 24, 2025 at 6:31 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news