pavlinpolicar.bsky.social
@pavlinpolicar.bsky.social
One encouraging takeaway from the study is that, at least in this particular task, open-source LLMs prove just as capable as OpenAI's commercial GPT4o. This means that universities could run their own LLM graders in-house without fear of compromising student privacy. 7/
January 29, 2025 at 7:16 PM
So, are we human TAs obsolete?

Well, not quite.

First, setting up good grading rubrics takes quite a bit of time and effort. Second, LLMs achieved an accuracy of 90%, which could still be improved. However, perhaps newer models will perform even better! 6/
January 29, 2025 at 7:16 PM
In terms of feedback, students actually seem to slightly *prefer* feedback written by LLMs over human-written feedback. While there is some nuance to this result, the conclusion is clear: students are just as happy with LLM-generated feedback as with TA-written feedback. 5/
January 29, 2025 at 7:16 PM
In our setup, LLMs determined whether answers satisfied predefined grading criteria that we, TAs, painstakingly prepared ahead of time. Here, LLMs achieve about ~90% accuracy. Small LLMs work well on easier questions but are overly generous for harder and open-ended questions. 4/
January 29, 2025 at 7:16 PM
We tested several models, including OpenAI's GPT4o and several open-source Llama 3 models of varying sizes. So, can LLMs grade student assignments?

The short answer is "mostly yes".

There are two aspects to grading student answers: the grade and the feedback. 3/
January 29, 2025 at 7:16 PM
We wanted to see whether LLMs could grade short text answers as well as (or better) than human TAs. Over the course of the semester, students answered 36 questions of varying difficulty, and their answers were randomly assigned to be graded by a human TA or an LLMs. 2/
January 29, 2025 at 7:16 PM