Skip to content

HKU IDS Scholar Seminar Series #24:

Unlocking Interpretable Control for Large Language Models and Beyond via Sparse Autoencoders

Speaker

Prof Difan Zou, Assistant Professor, HKU IDS & Department of Computer Science, School of Computing and Data Science, HKU

Date

Mar 20, 2026 (Fri)

Time

11:00am – 12:00nn

Venue

Tam Wing Fan Innovation Wing Two  |   Zoom 

Mode

Hybrid. Seats for on-site participants are limited. A confirmation email will be sent to participants who have successfully registered.

Abstract

Interpretability is the foundation of trustworthy and controllable large language model (LLM) deployment, yet unlocking actionable insights from black-box models remains a critical challenge. Sparse Autoencoders (SAEs) have emerged as a transformative solution decomposing dense LLM hidden states into sparse, human-understandable features that reveal the model’s internal mechanisms.
 
In this talk, I will discuss how SAEs enable genuine interpretability by isolating semantically coherent latent features, offering a clear window into which internal representations correspond to specific concepts. I will show core applications of this interpretability: targeted model unlearning that precisely removes unwanted knowledge without sacrificing general capability, controlled model steering that refines behavior through meaningful feature interventions, and extensions to emerging architectures like diffusion language models (DLMs) where SAEs unlock new interpretive and manipulative possibilities. I will also share key considerations for balancing interpretability, utility, and robustness, and outline how SAE-driven insights are turning theoretical interpretability into a practical tool for building more reliable, responsible AI systems. Join to explore how SAEs are redefining what it means to understand and control LLMs.

Speaker

Prof Difan ZOU

Assistant Professor @ HKU IDS & SCDS

Professor Difan Zou is an Assistant Professor in HKU IDS & Computer Science, School of Computing and Data Science, at The University of Hong Kong. He received his Ph.D. in Computer Science, University of California, Los Angeles (UCLA). He received a B. S degree in Applied Physics, from School of Gifted Young, USTC and a M. S degree in Electrical Engineering from USTC. He has published multiple papers on top-tier machine learning conferences including ICML, NeurIPS, ICLR, COLT, etc. He is a recipient of Bloomberg Data Science Ph.D. fellowship. His research interests are broadly in machine learning, optimization, and learning structured data (e.g., time-series or graph data), with a focus on theoretical understanding of the optimization and generalization in deep learning problems.

For full biography of Prof. ZOU, please refer to: https://datascience.hku.hk/people/difan-zou/

Moderator

Prof Andrew Luo

Assistant Professor @ HKU IDS & PSYC

Professor Andrew Luo is an Assistant Professor at the HKU Musketeers Foundation Institute of Data Science (IDS) and the Department of Psychology, The University of Hong Kong. He received his PhD in Neural Computation & Machine Learning from Carnegie Mellon University (advised by Prof. Michael J. Tarr and Prof. Leila Wehbe) and his BSc in Computer Science from MIT. His research sits at the intersection of computer vision, human visual representations, scene learning, and generative models, with a focus on building machine learning systems that perceive and understand the world in human-like ways, bridging cognitive science and AI.

For full biography of Prof. Luo, please refer to: https://datascience.hku.hk/people/andrew-luo/

For information, please contact:
Email: datascience@hku.hk