This tech can be used responsibly, with some effort. We at EPPI Reviewer/EPPI Centre are trying to do so, and trying to help others too.
🧵 18/18
This tech can be used responsibly, with some effort. We at EPPI Reviewer/EPPI Centre are trying to do so, and trying to help others too.
🧵 18/18
1. Ensure our results can be trusted.
2. Produce published examples of responsible use & evaluation of LLMs.
🧵 17/n
1. Ensure our results can be trusted.
2. Produce published examples of responsible use & evaluation of LLMs.
🧵 17/n
#EPPI-Reviewer now supports using an LLM (limited to GPT4o, for now) to run screening and data extraction tasks.
The LLM can be fed titles and abstracts or the full text.
All functionalities have been designed to *facilitate* (rather than obfuscate) their per-review evaluation.
🧵 16/n
#EPPI-Reviewer now supports using an LLM (limited to GPT4o, for now) to run screening and data extraction tasks.
The LLM can be fed titles and abstracts or the full text.
All functionalities have been designed to *facilitate* (rather than obfuscate) their per-review evaluation.
🧵 16/n
🧵 15/n
🧵 15/n
Effect: every evidence synthesis effort, and every LLM-task therein NEEDS to be evaluated on its own and then in its full context
🧵 14/n
Effect: every evidence synthesis effort, and every LLM-task therein NEEDS to be evaluated on its own and then in its full context
🧵 14/n
- Every specific prompt,
- The contents supplied,
- And even more, the interaction between the 2.
What we found early on is that most "hallucinations" happen when the question asked implies an assumption that isn't valid for a given “content”.
🧵 13/n
- Every specific prompt,
- The contents supplied,
- And even more, the interaction between the 2.
What we found early on is that most "hallucinations" happen when the question asked implies an assumption that isn't valid for a given “content”.
🧵 13/n
Any new feature shipped out needs to be matched by effective, accessible to regular users, and not-too-onerous systems to evaluate in full how well it works.
This is paramount, because as a general rule, evaluations DO NOT generalise.
🧵 12/n
Any new feature shipped out needs to be matched by effective, accessible to regular users, and not-too-onerous systems to evaluate in full how well it works.
This is paramount, because as a general rule, evaluations DO NOT generalise.
🧵 12/n
Thus, from 2024 onwards, we developers have been in "marathon-length, sprint mode".
The main focus is to deliver facilities to leverage LLMs in ways that maximise the trustworthiness of results.
🧵 11/n
Thus, from 2024 onwards, we developers have been in "marathon-length, sprint mode".
The main focus is to deliver facilities to leverage LLMs in ways that maximise the trustworthiness of results.
🧵 11/n
1. LLMs will be used in Evidence Synthesis - no matter what our positions will be.
2. Thus, what we want to do is to help the field to use LLMs responsibly.
🧵 10/n
1. LLMs will be used in Evidence Synthesis - no matter what our positions will be.
2. Thus, what we want to do is to help the field to use LLMs responsibly.
🧵 10/n
🧵 9/n
🧵 9/n
(With many thanks to Lena Schmidt, @kaitlynhair.bsky.social, Fiona Campbell, @clkapp.bsky.social, Alireza Khanteymoori, Dawn Craig and Mark Engelbert.)
Take home message was: yes, this kind of tech will be used in Evidence Synthesis - a lot.
🧵 8/n
(With many thanks to Lena Schmidt, @kaitlynhair.bsky.social, Fiona Campbell, @clkapp.bsky.social, Alireza Khanteymoori, Dawn Craig and Mark Engelbert.)
Take home message was: yes, this kind of tech will be used in Evidence Synthesis - a lot.
🧵 8/n
🧵 7/n
🧵 7/n
🧵 6/n
🧵 6/n
🧵 5/n
🧵 5/n
Thus, the 1st question was: can we really use LLMs and get reliable results?
🧵 4/n
Thus, the 1st question was: can we really use LLMs and get reliable results?
🧵 4/n
🧵 3/n
🧵 3/n
It's been quite a journey, so perhaps it's worth sharing some thoughts.
🧵 2/n
It's been quite a journey, so perhaps it's worth sharing some thoughts.
🧵 2/n