BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Date iCal//NONSGML kigkonsult.se iCalcreator 2.20.2//
METHOD:PUBLISH
X-WR-CALNAME;VALUE=TEXT:Eventi DIAG
BEGIN:VTIMEZONE
TZID:Europe/Paris
BEGIN:STANDARD
DTSTART:20251026T030000
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20260329T020000
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:calendar.30070.field_data.0@www.diag.uniroma1.it
DTSTAMP:20260411T172331Z
CREATED:20260221T222214Z
DESCRIPTION:AbstractIn this talk\, we investigate the robustness and reliab
 ility of benchmark-based evaluation for Large Language Models by systemati
 cally paraphrasing benchmark questions. We show that small wording changes
  can alter model predictions and reduce accuracy across multiple benchmark
 s and models\, even when overall rankings remain stable. These findings su
 ggest that current benchmarks may overestimate true generalization and mot
 ivate paraphrase-aware evaluation practices. BioKevin Roitero is a tenure-
 track Assistant Professor at the University of Udine. His research focuses
  on AI\, NLP\, and Information Retrieval\, with work published at top venu
 es such as SIGIR\, WSDM\, WWW\, and CIKM\, and recognized with multiple aw
 ards.
DTSTART;TZID=Europe/Paris:20260221T231500
DTEND;TZID=Europe/Paris:20260221T231500
LAST-MODIFIED:20260222T221919Z
LOCATION:Aula B203
SUMMARY:On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
  - Kevin Roitero
URL;TYPE=URI:http://www.diag.uniroma1.it/node/30070
END:VEVENT
END:VCALENDAR
