HKU IDS Seminar: Perception Encoder Meets PerceptionLM: General Visual Understanding from Hidden Features to Open DataHKU IDS Seminar: - HKU Musketeers Foundation Institute of Data Science

HKU IDS Seminar:

Perception Encoder Meets PerceptionLM: General Visual Understanding from Hidden Features to Open Data

Host:

Co-host:

Speaker

Dr Andrea Madotto

Research Scientist, Fundamental AI Research, Meta

Date

May 20, 2025 (Tue)

Time

11:00am – 12:00pm

Venue

IDS Seminar Room, P603, Graduate House | Zoom

Mode

Hybrid. Seats for on-site participants are limited. A confirmation email will be sent to participants who have successfully registered.

Abstract

We present a two-part research effort aimed at advancing image and video understanding through open, scalable, and general-purpose vision-language models. At the core is the Perception Encoder (PE), a state-of-the-art vision encoder trained using a simple yet powerful contrastive vision-language objective. Unlike traditional encoders that rely on task-specific pretraining, PE demonstrates that with careful scaling and robust video data integration, contrastive training alone can yield general visual embeddings that perform strongly across a wide range of tasks. Notably, these high-performing embeddings emerge not at the network’s output but within its intermediate layers. To surface these representations, we introduce two alignment techniques: language alignment for cross-modal tasks and spatial alignment for dense prediction. The resulting PE models achieve state-of-the-art performance on tasks such as zero-shot classification, retrieval, question answering, and spatial reasoning. Building on this foundation, we introduce the Perception Language Model (PLM), which leverages PE within a fully open and reproducible framework for image and video understanding. In contrast to prior work that relies on distillation from closed-source models, PLM is trained without such dependencies, using large-scale synthetic and human-annotated video data. To address key data gaps—particularly in fine-grained temporal reasoning—we release a new dataset containing 2.8M human-labeled video question-answer pairs and grounded captions. Additionally, we propose PLM–VideoBench, a new benchmark suite targeting the challenges of spatio-temporal understanding across “what”, “where”, “when”, and “how” dimensions. By combining transparent model design, high-quality data, and rigorous evaluation, our work enables reproducible, state-of-the-art research in multimodal vision-language learning.

Speaker

Dr Andrea Madotto

Research Scientist, Fundamental AI Research, Meta

Dr Andrea Madotto is a Research Scientist in Fundamental AI Research (FAIR) at Meta working on Multimodal Language Models, and LLM driven agents (i.e., navigating UI based on user goals). He received his PhD in Electronic & Computer Engineering at The Hong Kong University of Science and Technology (HKUST). He received the outstanding paper award from ACL2019, and the best paper award from the NeurIPS2019 ConvAI workshop. His work has been featured in MIT technology review and VentureBeat. He serves as program committee and reviewer for various machine learning and natural language processing conferences such as ACL, EMNLP, NeurIPS, ICML, ICLR, and AAAI, and journals such as the Journal of Natural Language Engineering and Computer Speech and Languages, and he co-presented the NeurIPS 2020 tutorial on Deeper Conversational AI.

For full biography of Dr Madotto, please refer to: https://andreamad8.github.io/

Moderator

Prof Ping LUO

Associate Director (Innovation and Outreach), HKU Musketeers Foundation Institute of Data Science

Professor Ping Luo’s researches aim at 1) developing Differentiable/ Meta/ Reinforcement Learning algorithms that endow machines and devices to solve complex tasks with larger autonomy, 2) understanding foundations of deep learning algorithms, and 3) enabling applications in Computer Vision and Artificial Intelligence. Professor Ping Luo received his PhD degree in 2014 in Information Engineering, the Chinese University of Hong Kong (CUHK), supervised by Prof. Xiaoou Tang (founder of SenseTime Group Ltd.) and Prof. Xiaogang Wang.

For full biography of Professor LUO, please refer to: https://datascience.hku.hk/people/ping-luo/