On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Speaker:

Kevin Roitero

Data dell'evento:

Sabato, 21 February, 2026 - 23:15

Luogo:

Aula B203

Contatto:

fsilvestri@diag.uniroma1.it

Abstract

In this talk, we investigate the robustness and reliability of benchmark-based evaluation for Large Language Models by systematically paraphrasing benchmark questions. We show that small wording changes can alter model predictions and reduce accuracy across multiple benchmarks and models, even when overall rankings remain stable. These findings suggest that current benchmarks may overestimate true generalization and motivate paraphrase-aware evaluation practices.

Bio

Kevin Roitero is a tenure-track Assistant Professor at the University of Udine. His research focuses on AI, NLP, and Information Retrieval, with work published at top venues such as SIGIR, WSDM, WWW, and CIKM, and recognized with multiple awards.

gruppo di ricerca:

Theory of Deep Learning

keywords:

Machine learning and AI security