Safety Neurons in Large Language Models

Speaker:

Stjepan Picek

Data dell'evento:

Friday, 8 May, 2026 - 10:00

Luogo:

Aula A3, via Ariosto 25, Roma

Contatto:

Massimo Mecella

Abstract

Safety Neurons in Large Language Models

Large language models (LLMs) achieve state-of-the-art performance across a wide range of tasks, but their widespread deployment raises urgent concerns around security, privacy, and misuse. Building on recent progress in sparse mechanistic interpretability, particularly findings from vision models, this talk examines the hypothesis that a small set of neurons or features may play a disproportionate role in safety-aligned behavior in LLMs. We begin by presenting methods for identifying such sparse, interpretable substructures and evaluating how inference-time manipulation of these components can degrade safety behavior in both white-box and black-box settings. We then extend this perspective to Mixture-of-Experts (MoE) models, introducing a training-free, lightweight, and architecture-agnostic framework for probing and stress-testing the safety alignment of modern MoE LLMs during inference. Finally, the talk discusses the broader implications and applications of “safety features,” including their role in safety-relevant behavior in code-generation models, and highlights opportunities for more robust alignment and defense.

Bio. Stjepan Picek is a full professor at the University of Zagreb, Faculty of Electrical Engineering and Computing, Croatia. He also holds an associate professor position at Radboud University, Nijmegen, and an adjunct professor position at the University of Bergen, Norway. Before that, he was an assistant professor at TU Delft and a postdoctoral researcher at MIT, USA, and KU Leuven, Belgium. Stjepan completed PhD in computer science in 2015 at the University of Zagreb, Croatia and Radboud University, The Netherlands. In 2024, he finished a PhD in mathematics at the University of Paris 8, France. His research interests include security and cryptography, machine learning, and evolutionary computation. To date, Stjepan has given more than 80 invited talks and published more than 200 refereed papers. He is a program committee member and reviewer for a number of conferences and journals and a member of several professional societies. His work has been featured in the mainstream media and on popular technology blogs. He is a member of ELLIS and a Fellow of the Young Academy of Europe.

Stjepan Picek is visiting Sapienza/DIAG in the context of the EMAI program, in order to conduct research and teaching activities.

gruppo di ricerca:

Processes, Services and Software Engineering

keywords:

Artificial Intelligence and Robotics
Security for cyber-physical systems