On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Speaker:
Kevin Roitero
Data dell'evento:
Sabato, 21 February, 2026 - 23:15
Luogo:
Aula B203
Contatto:
fsilvestri@diag.uniroma1.it
Abstract
In this talk, we investigate the robustness and reliability of benchmark-based evaluation for Large Language Models by systematically paraphrasing benchmark questions. We show that small wording changes can alter model predictions and reduce accuracy across multiple benchmarks and models, even when overall rankings remain stable. These findings suggest that current benchmarks may overestimate true generalization and motivate paraphrase-aware evaluation practices.
Bio
Kevin Roitero is a tenure-track Assistant Professor at the University of Udine. His research focuses on AI, NLP, and Information Retrieval, with work published at top venues such as SIGIR, WSDM, WWW, and CIKM, and recognized with multiple awards.
gruppo di ricerca:
keywords: