EPPI Reviewer
banner
eppi-reviewer.bsky.social
EPPI Reviewer
@eppi-reviewer.bsky.social
News and updates for EPPI Reviewer, software for systematic reviews, literature reviews and meta-analysis.
So, if you worry that LLMs (or so called AI) will inevitably deteriorate the quality of evidence synthesis work, and thus devalue it, perhaps we can help.
This tech can be used responsibly, with some effort. We at EPPI Reviewer/EPPI Centre are trying to do so, and trying to help others too.
🧵 18/18
February 13, 2025 at 3:26 PM
Meanwhile, staff at the @eppicentre.bsky.social, along with collaborators are busy designing, running, and reporting on "Studies within a review" (SWARS). This is done for 2 purposes:
1. Ensure our results can be trusted.
2. Produce published examples of responsible use & evaluation of LLMs.
🧵 17/n
February 13, 2025 at 3:26 PM
Wrapping up:
#EPPI-Reviewer now supports using an LLM (limited to GPT4o, for now) to run screening and data extraction tasks.
The LLM can be fed titles and abstracts or the full text.
All functionalities have been designed to *facilitate* (rather than obfuscate) their per-review evaluation.
🧵 16/n
February 13, 2025 at 3:24 PM
[One under-evaluated, or rather, borderline ignored issue is that of error propagation: the more steps are "automated", the more errors will happen at each different stage, and they are likely to compound non-linearly, undermining more and more the trustworthiness of results.]
🧵 15/n
February 13, 2025 at 3:24 PM
For example, the machine might make up the answer to the "how many participants?" question when supplied with an RCT protocol, rather than the actual study report.
Effect: every evidence synthesis effort, and every LLM-task therein NEEDS to be evaluated on its own and then in its full context
🧵 14/n
February 13, 2025 at 3:23 PM
LLMs are necessarily extremely sensitive to:
- Every specific prompt,
- The contents supplied,
- And even more, the interaction between the 2.
What we found early on is that most "hallucinations" happen when the question asked implies an assumption that isn't valid for a given “content”.
🧵 13/n
February 13, 2025 at 3:22 PM
The guiding principles are conceptually simple, though.
Any new feature shipped out needs to be matched by effective, accessible to regular users, and not-too-onerous systems to evaluate in full how well it works.
This is paramount, because as a general rule, evaluations DO NOT generalise.
🧵 12/n
February 13, 2025 at 3:21 PM
Side effects are that we need to move fast, but move thoughtfully and effectively too.
Thus, from 2024 onwards, we developers have been in "marathon-length, sprint mode".
The main focus is to deliver facilities to leverage LLMs in ways that maximise the trustworthiness of results.
🧵 11/n
February 13, 2025 at 3:20 PM
We still challenge one another with such questions, but we are also all aware that we have fully committed to a general strategy.
1. LLMs will be used in Evidence Synthesis - no matter what our positions will be.
2. Thus, what we want to do is to help the field to use LLMs responsibly.
🧵 10/n
February 13, 2025 at 3:19 PM
For us working on #EPPI-Reviewer, this single clear "answer" opened up a multitude of new questions about what we *should* do, how to manage risks without curbing our effectiveness, and more importantly, on what kind of "effects" we wanted to have.
🧵 9/n
February 13, 2025 at 3:18 PM
The results are here: ceur-ws.org/Vol-3832/pap...
(With many thanks to Lena Schmidt, @kaitlynhair.bsky.social, Fiona Campbell, @clkapp.bsky.social, Alireza Khanteymoori, Dawn Craig and Mark Engelbert.)
Take home message was: yes, this kind of tech will be used in Evidence Synthesis - a lot.
🧵 8/n
ceur-ws.org
February 13, 2025 at 3:17 PM
At the Hackathon, we were able to run a very small, but we think "carefully planned" study that we hoped could begin answering our initial question, and concurrently provide an example of how to run such studies responsibly.
🧵 7/n
February 13, 2025 at 3:17 PM
At the same time, it was very evident that lots of quick and dirty evaluations were being run and/or published, often flawed, not only because of their small sample size, but more worryingly, undermined by not-very-thoughtful methods and/or an excess of optimism.
🧵 6/n
February 13, 2025 at 3:16 PM
The only way to find out was to try. So we put together "proof of concept" functionalities, and put them to a preliminary test with the invaluable help of external collaborators, gathered together for the Evidence Synthesis Hackathon (2023 - www.eshackathon.org).
🧵 5/n
Developing collaborations and technology for evidence synthesis
An event series to develop open software for evidence synthesis
www.eshackathon.org
February 13, 2025 at 3:15 PM
This is important, because it can potentially mitigate or even remove the problem of (so called) hallucinations, and applied to many, if not most steps of evidence synthesis, broadly defined.
Thus, the 1st question was: can we really use LLMs and get reliable results?
🧵 4/n
February 13, 2025 at 3:14 PM
Back when GPT was "new", it was very clear to us that Large Language Models (LLMs) could be useful (and thus: disruptive) in the field of evidence synthesis, because in our use-case, one could ask to the machine questions, supplied along with the text that should be used to answer them.
🧵 3/n
February 13, 2025 at 3:14 PM
It took us 15 months or more to move from "proof of concept" to releasing a version of EPPI Reviewer that integrates comprehensive LLM-driven functionalities, available to all users (see: eppi.ioe.ac.uk/cms/Default....).
It's been quite a journey, so perhaps it's worth sharing some thoughts.
🧵 2/n
Latest Changes (12/02/2025 - V6.16.0.0) - Forum announcements
eppi.ioe.ac.uk
February 13, 2025 at 3:13 PM