Skip to content

Program of CPAL 2024

Program at a glance

January 3 January 4 January 5 January 6
Morning Session
Registration
8:00 am – 9:00 am
Opening Remarks
9:15 am – 10:00 am
Rising Star 2
10:00 am – 11:00 am
Rising Star 3
10:00 am – 11:00 am
Coffee Break
Rising Star 1
11:20 am – 12:30 am
Oral Session 2
11:20 am – 12:30 am
Oral Session 4
11:20 am – 12:30 am
Open-Door Roundtable
(Organization Commitee)
11:20 am – 12:30 am
Lunch Break
OPEN TO PUBLIC:
Half-Day Tutorials
(two parallels)
(1) Low-dim in DNNs
(2) Inverse Problems
1:30 pm – 3:40 pm

Room: LG.17 & LG.18, CPD
Oral Session 1
2:30 pm – 3:40 pm
Oral Session 3
2:30 pm – 3:40 pm
Oral Session 5
2:30 pm – 3:40 pm
Coffee Break
Panel Discussion
4:00 pm – 5:00 pm
Oral Session 6
4:00 pm – 5:00 pm
OPEN TO PUBLIC:
Half-Days Tutorials
(two parallels) (cont.)
4:00 pm – 6:30 pm

Room: LG.17 & LG.18, CPD
Cocktail Reception
5:00 pm – 6:30 pm
Spotlight Poster 1
5:00 pm – 6:30 pm
Spotlight Poster 2
5:00 pm – 6:30 pm
Conference Dinner
7:00 pm – 9:00 pm
Tram Party
7:00 pm – 9:00 pm

** The Conference Organizer reserves the right to make changes to the program without prior notice.

Keynote Speakers

Professor Dan Alistarh
Institute of Science and Technology Austria
Professor SueYeon Chung
New York University
Professor Kostas Daniilidis
University of Pennsylvania
Professor Maryam Fazel
University of Washington
Professor Tom Goldstein
University of Maryland
Professor Yingbin Liang
Ohio State University
Professor Dimitris Papailiopoulos
University of Wisconsin-Madison
Professor Stefano Soatto
Amazon
University of California, Los Angeles 
 
Professor Jong Chul Ye
Korea Advanced Institute of Science & Technology (KAIST)

Talk Details

Abstract
A key barrier to the wide deployment of highly-accurate machine learning models, whether for language or vision, is their high computational and memory overhead. Although we possess the mathematical tools for highly-accurate compression of such models, these theoretically-elegant techniques require second-order information of the model’s loss function, which is hard to even approximate efficiently at the scale of billion-parameter models.In this talk, I will describe our work on bridging this computational divide, which enables the accurate second-order pruning and quantization of models at truly massive scale. Compressed using our techniques, models with billions and even trillions of parameters can be executed efficiently on a few GPUs, with significant speedups, and negligible accuracy loss. Based in part on our work, the community has been able to run accurate billion or even trillion-parameter models on computationally-limited devices.

Bio
Dan Alistarh is a Professor at IST Austria, in Vienna. Previously, he was a Researcher with Microsoft, a Postdoc at MIT CSAIL, and received his PhD from the EPFL. His research is on algorithms for efficient machine learning and high-performance computing, with a focus on scalable DNN inference and training, for which he was awarded an ERC Starting Grant in 2018. In his spare time, he works with the ML research team at Neural Magic, a startup based in Boston, on making compression faster, more accurate and accessible to practitioners.

Abstract
A central goal in neuroscience is to understand how orchestrated computations in the brain arise from the properties of single neurons and networks of such neurons. Answering this question requires theoretical advances that shine a light on the ‘black box’ of representations in neural circuits. In this talk, we will demonstrate theoretical approaches that help describe how cognitive task implementations emerge from the structure in neural populations and from biologically plausible neural networks.
We will introduce a new theory that connects geometric structures that arise from neural population responses (i.e., neural manifolds) to the neural representation’s efficiency in implementing a task. In particular, this theory describes how many neural manifolds can be represented (or ‘packed’) in the neural activity space while they can be linearly decoded by a downstream readout neuron. The intuition from this theory is remarkably simple: like a sphere packing problem in physical space, we can encode many “neural manifolds” into the neural activity space if these manifolds are small and low-dimensional, and vice versa.
Next, we will describe how such an approach can, in fact, open the ‘black box’ of distributed neuronal circuits in a range of settings, such as experimental neural datasets and artificial neural networks. In particular, our method overcomes the limitations of traditional dimensionality reduction techniques, as it operates directly on the high-dimensional representations. Furthermore, this method allows for simultaneous multi-level analysis, by measuring geometric properties in neural population data and estimating the amount of task information embedded in the same population.
Finally, we will discuss our recent efforts to fully extend this multi-level description of neural populations by (1) understanding how task-implementing neural manifolds emerge across brain regions and during learning, (2) investigating how neural tuning properties shape the representation geometry in early sensory areas, and (3) demonstrating the impressive task performance and neural predictivity achieved by optimizing a deep network to maximize the capacity of neural manifolds. By expanding our mathematical toolkit for analyzing representations underlying complex neuronal networks, we hope to contribute to the long-term challenge of understanding the neuronal basis of tasks and behaviors.

Bio
SueYeon Chung is an Assistant Professor in the Center for Neural Science at NYU, with a joint appointment in the Center for Computational Neuroscience at the Flatiron Institute, an internal research division of the Simons Foundation. She is also an affiliated faculty member at the Center for Data Science and Cognition and Perception Program at NYU. Prior to joining NYU, she was a Postdoctoral Fellow in the Center for Theoretical Neuroscience at Columbia University, and BCS Fellow in Computation at MIT. Before that, she received a Ph.D. in applied physics at Harvard University, and a B.A. in mathematics and physics at Cornell University. She received the Klingenstein-Simons Fellowship Award in Neuroscience in 2023. Her main research interests lie at the intersection between computational neuroscience and deep learning, with a particular focus on understanding and interpreting neural computation in biological and artificial neural networks by employing methods from neural network theory, statistical physics, and high-dimensional statistics.

Abstract
Equivariant representations are crucial in various scientific and engineering domains because they encode the inherent symmetries present in physical and biological systems, thereby providing a more natural and efficient way to model them. In the context of machine learning and perception, equivariant representations ensure that the output of a model changes in a predictable way in response to transformations of its input, such as 2D or 3D rotation or scaling. In this talk, we will show a systematic way of how to achieve equivariance by design and how such an approach can yield parsimony in training data and model capacity. 

Bio
Kostas Daniilidis is the Ruth Yalom Stone Professor of Computer and Information Science at the University of Pennsylvania where he has been faculty since 1998. He is an IEEE Fellow. He was the director of the GRASP laboratory from 2008 to 2013, Associate Dean for Graduate Education from 2012-2016, and Faculty Director of Online Learning from 2013- 2017. He obtained his undergraduate degree in Electrical Engineering from the National Technical University of Athens, 1986, and his PhD (Dr.rer.nat.) in Computer Science from the University of Karlsruhe, 1992, under the supervision of Hans-Hellmut Nagel. He received the Best Conference Paper Award at ICRA 2017. He co-chaired ECCV 2010 and 3DPVT 2006. His most cited works have been on event-based vision, equivariant learning, 3D human pose, and hand-eye calibration.

Abstract
Many behaviors observed in deep neural networks still lack satisfactory explanation; e.g., how does an overparameterized neural network avoid overfitting and generalize to unseen data? Empirical evidence suggests that generalization depends on which zero-loss local minimum is attained during training. The shape of the training loss around a local minimum affects the model’s performance: “Flat” minima—around which the loss grows slowly—appear to generalize well. Clarifying this phenomenon helps explain generalization properties, which still largely remain a mystery.

In this talk we focus on a simple class of overparameterized nonlinear models, those arising in low-rank matrix recovery. We study several key models: matrix sensing, phase retrieval, robust Principal Component Analysis, covariance matrix estimation, and single hidden layer neural networks with quadratic activation. We prove that in these models, flat minima (measured by average curvature) exactly recover the ground truth under standard statistical assumptions, and we prove weak recovery for matrix completion. These results suggest (i) a theoretical basis for favoring methods that bias iterates towards flat solutions, (ii) use of Hessian trace as a good regularizer. Since the landscape properties we prove are algorithm-agnostic, a future direction is to pair these findings with the analysis of common training algorithms to better understand the interplay between the loss landscape and algorithmic implicit bias.

Bio
Maryam Fazel is the Moorthy Family Professor of Electrical and Computer Engineering at the University of Washington, with adjunct appointments in Computer Science and Engineering, Mathematics, and Statistics. Maryam received her MS and PhD from Stanford University, her BS from Sharif University of Technology in Iran, and was a postdoctoral scholar at Caltech before joining UW. She is a recipient of the NSF Career Award, UWEE Outstanding Teaching Award, and UAI conference Best Student Paper Award with her student. She directs the Institute for Foundations of Data Science (IFDS), a multi-site NSF TRIPODS Institute. She serves on the Editorial board of the MOS-SIAM Book Series on Optimization, is an Associate Editor of the SIAM Journal on Mathematics of Data Science and an Action Editor of Journal of Machine Learning Research. Her current research interests are in the area of optimization in machine learning and control.

Abstract
This talk will have two parts. In the first part, I’ll talk about mathematical perspectives on how to watermark generative models to prevent parameter theft, ways to watermark generative model outputs to enable detection, and ways to perform post-hoc detection of language models without relying on watermarks. I’ll emphasize the important idea of using statistical hypothesis testing and p-values to provide rigorous control of the false-positive rate of detection. In the second part of the talk, I’ll present methods for constructing neural networks that exhibit “slow” thinking abilities akin to human logical reasoning. Rather than learning simple pattern matching rules, these networks have the ability to synthesize algorithmic reasoning processes and solve difficult discrete search and planning problems that cannot be solved by conventional AI systems. Interestingly, these reasoning systems naturally exhibit error correction and robustness properties that make them more difficult to break than their fast thinking counterparts.

Bio
Tom Goldstein is the Perotto Associate Professor of Computer Science at University of Maryland. His research lies at the intersection of machine learning and optimization, and targets applications in computer vision and signal processing. He works at the boundary between theory and practice, leveraging mathematical foundations, complex models, and efficient hardware to build practical, high-performance systems. He design optimization methods for a wide range of platforms ranging from powerful cluster/cloud computing environments to resource limited integrated circuits and FPGAs. Before joining the faculty at Maryland, Tom completed his PhD in Mathematics at UCLA, and was a research scientist at Rice University and Stanford University. Tom has been the recipient of several awards, including SIAM’s DiPrima Prize, a DARPA Young Faculty Award, a JP Morgan Faculty Award, an Amazon Research Award, and a Sloan Fellowship. 

Abstract
Transformers have recently revolutionized many machine learning domains and one salient discovery is their remarkable in-context learning capability, where models can capture an unseen task by utilizing task-specific prompts without further parameters fine-tuning. In this talk, I will present our recent work that aims at understanding the in-context learning mechanism of transformers. Our focus is on the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent in order to in-context learn linear function classes. I will first present our characterization of the training convergence of in-context learning for data with balanced and imbalanced features, respectively. I will then discuss the insights that we obtain about attention models and training processes. I will also talk about the analysis techniques that we develop which may be useful for a broader set of problems. I will finally conclude my talk with comments on a few future directions.
This is a joint work with Yu Huang (UPenn) and Yuan Cheng (NUS).

Bio

Dr. Yingbin Liang is currently a Professor at the Department of Electrical and Computer Engineering at the Ohio State University (OSU). She received the Ph.D. degree in Electrical Engineering from the University of Illinois at Urbana-Champaign in 2005, and served on the faculty of University of Hawaii and Syracuse University before she joined OSU. Dr. Liang’s research interests include machine learning, optimization, statistical signal processing, information theory, and wireless communications. Dr. Liang received the National Science Foundation CAREER Award in 2009, and the State of Hawaii Governor Innovation Award in 2009. Her paper received EURASIP Best Paper Award in 2014. She is currently serving as an Associate Editor for IEEE Transactions on Information Theory. She is an IEEE fellow. 

Abstract
Can a language model truly “understand” arithmetic? We explore this by trying to teach small transformers from scratch to perform elementary arithmetic operations, using the next-token prediction objective. We first demonstrate that conventional training data (i.e., “A+B=C”) is not effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions which, in some cases, can be explained through connections to low-rank matrix completion. We then train these small models on chain-of-thought data that includes intermediate steps. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We finally discuss the issue of length generalization: can a model trained on n digits add n+1 digit numbers? Humans don’t need to be taught every digit length of addition to be able to perform it. It turns out that language models aren’t great at length generalization, but we catch glimpses of it in “unstable” scenarios. Surprisingly, the infamous U-shaped overfitting curve makes an appearance!

Bio

Dimitris Papailiopoulos is an Assistant Professor of Electrical and Computer Engineering at the University of Wisconsin-Madison, a faculty fellow of the Grainger Institute for Engineering, and a faculty affiliate at the Wisconsin Institute for Discovery. His research interests span machine learning, information theory, and distributed systems, with a current focus on efficient large-scale training algorithms and coding-theoretic techniques for robust machine learning. Between 2014 and 2016, Dimitris was a postdoctoral researcher at UC Berkeley and a member of the AMPLab. He earned his Ph.D. in ECE from UT Austin in 2014, under the supervision of Alex Dimakis. In 2007 he received his ECE Diploma and in 2009 his M.Sc. degree from the Technical University of Crete, in Greece. Dimitris is a recipient of the NSF CAREER Award (2019), two Sony Faculty Innovation Awards (2019 and 2020), a joint IEEE ComSoc/ITSoc Best Paper Award (2020), an IEEE Signal Processing Society, Young Author Best Paper Award (2015), the Vilas Associate Award (2021), the Emil Steiger Distinguished Teaching Award (2021), and the Benjamin Smith Reynolds Award for Excellence in Teaching (2019). In 2018, he co-founded MLSys, a new conference that targets research at the intersection of machine learning and systems. In 2018 and 2020 he was program co-chair for MLSys, and in 2019 he co-chaired the 3rd Midwest Machine Learning Symposium. 

Abstract
Large Language Models and Multimodal Foundation Models, despite the simple predictive learning criterion and absence of explicit complexity bias, have shown the ability to capture the structure and “meaning” of data. I will introduce a notion of “meaning” for large language models as equivalence classes of sentences, and describe methods to establish a geometry and topology in the space of meanings, as well as an algebra so meanings can be composed and asymmetric relations such as entailment and implication can be quantified. Meanings as equivalence classes of sentences determined by the trained embedings can be defined, computed and quantified for pre-trained models, without the need for instruction tuning, reinforcement learning, or prompt engineering. Meanings as trajectories can be shown to align with human assessment through manually annotated benchmarks and can, as the outputs of dynamical systems, be controlled. I will show illustrative examples using both text and imaging modalities.

Bio

Professor Soatto received his Ph.D. in Control and Dynamical Systems from the California Institute of Technology in 1996; he joined UCLA in 2000 after being Assistant and then Associate Professor of Electrical and Biomedical Engineering at Washington University, and Research Associate in Applied Sciences at Harvard University. Between 1995 and 1998 he was also Ricercatore in the Department of Mathematics and Computer Science at the University of Udine – Italy. He received his D.Ing. degree (highest honors) from the University of Padova- Italy in 1992. His general research interests are in Computer Vision and Nonlinear Estimation and Control Theory. In particular, he is interested in ways for computers to use sensory information (e.g. vision, sound, touch) to interact with humans and the environment. Dr. Soatto is the recipient of the David Marr Prize (with Y. Ma, J. Kosecka and S. Sastry of U.C. Berkeley) for work on Euclidean reconstruction and reprojection up to subgroups. He also received the Siemens Prize with the Outstanding Paper Award from the IEEE Computer Society for his work on optimal structure from motion (with R. Brockett of Harvard). He received the National Science Foundation Career Award and the Okawa Foundation Grant. He is Associate Editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and a Member of the Editorial Board of the International Journal of Computer Vision (IJCV) and Foundations and Trends in Computer Graphics and Vision.

Abstract

The recent advent of diffusion models has led to significant progress in solving inverse problems, leveraging these models as effective generative priors. Nonetheless, challenges related to the ill-posed nature of such problems remain, such as 3D extension and overcoming inherent ambiguities in measurements. In this talk, we introduce strategies to address these issues. First, to enable 3D extension using only 2D diffusion models, we propose a novel approach using two perpendicular pre-trained 2D diffusion models which guides each solver to solve the 3D inverse problem. Specifically, by modeling the 3D data distribution as a product of 2D distributions sliced in different directions, our method effectively addresses the curse of dimensionality from the image guidance from the perpendicular direction. Second, drawing inspiration from the human ability to resolve visual ambiguities through perceptual biases, we introduce a novel latent diffusion inverse solver by incorporating guidance by text prompts. Specifically, our method applies the textual description of the preconception of the solution during the reverse sampling phase, of which description is dynamically reinforced through null-text optimization for adaptive negation. Our comprehensive experimental results show that our method successfully mitigates ambiguity in latent diffusion inverse solvers, enhancing their effectiveness and accuracy. 

Bio

Jong Chul Ye is a Professor at the Kim Jaechul Graduate School of Artificial Intelligence (AI) of Korea Advanced Institute of Science and Technology (KAIST), Korea. He received his B.Sc. and M.Sc. degrees from Seoul National University, Korea, and his PhD from Purdue University. Before joining KAIST, he worked at Philips Research and GE Global Research in New York. He has served as an associate editor of IEEE Trans. on Image Processing and an editorial board member for Magnetic Resonance in Medicine. He is currently an associate editor for IEEE Trans. on Medical Imaging and a Senior Editor of IEEE Signal Processing Magazine. He is an IEEE Fellow, was the Chair of IEEE SPS Computational Imaging TC, and IEEE EMBS Distinguished Lecturer. He was a General co-chair (with Mathews Jacob) for IEEE Symposium on Biomedical Imaging (ISBI) 2020. His research interest is in machine learning for biomedical imaging and computer vision.

Rising Stars Award

The Conference on Parsimony and Learning (CPAL) launches the Rising Stars Award program to highlight exceptional junior researchers at a critical inflection and starting point in their career: last-year PhD students, postdoctoral scholars, first-year tenure track faculty, or industry researcher within two years of graduation. The Awardees shall share our commitment to creating a more diverse and inclusive scientific community.

Applicants from a wide variety of fields and backgrounds who are engaging in addressing the parsimonious, low dimensional structures that prevail in machine learning, signal processing, optimization, systems, interdisciplinary applications and beyond are encouraged to apply. Awardees will be presenting at the Conference on-site as well! 

More information: https://cpal.cc/rising_stars/
CPAL Rising Stars Awardees: https://cpal.cc/rising_stars_awardees/

List of Awardees

Optimization for statistical learning with low dimensional structure: regularity and conditioning

Abstract
Many statistical machine learning problems, where one aims to recover an underlying low-dimensional signal, are based on optimization. Existing work often overlooked the computational complexity in solving the optimization problem, or required case-specific algorithm and analysis – especially for nonconvex problems. This talk addresses the above two issues from a unified perspective of conditioning. In particular, we show that once the sample size exceeds the intrinsic dimension, (1) a broad class of convex and nonsmooth nonconvex problems are well-conditioned, (2) well conditioning in turn ensures the efficiency of out-of-box optimization methods and inspires new algorithms. Lastly, we show that a conditioning notion called flatness leads to accurate recovery in overparametrized models.

Approximately Equivariant Graph Networks

Abstract
Graph neural networks (GNNs) are commonly described as being permutation equivariant with respect to node relabeling in the graph. This symmetry of GNNs is often compared to the translation equivariance of Euclidean convolution neural networks (CNNs). However, these two symmetries are fundamentally different: The translation equivariance of CNNs corresponds to active symmetries, whereas the permutation equivariance of GNNs corresponds to passive symmetries. In this talk, we focus on the active symmetries of GNNs, by considering a learning setting where signals are supported on a fixed graph. In this case, the natural symmetries of GNNs are the automorphisms of the graph. Since real-world graphs tend to be asymmetric, we relax the notion of symmetries by formalizing approximate symmetries via graph coarsening. We propose approximately equivariant graph networks to implement these symmetries and investigate the symmetry model selection problem. We theoretically and empirically show a bias-variance tradeoff between the loss in expressivity and the gain in the regularity of the learned estimator, depending on the chosen symmetry group.

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

Abstract
In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that remain unmodified by SGD. We focus on two classes of invariant sets that correspond to simpler (sparse or low-rank) subnetworks and commonly appear in modern architectures. Our analysis uncovers that SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. We establish a sufficient condition for stochastic attractivity based on a competition between the loss landscape’s curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Finally, through this analysis, we mechanistically explain why early training with large learning rates for extended periods benefits subsequent generalization.

Emergent properties of heuristics in machine learning

Abstract
Successful methods in modern machine learning practice are built on solid intuition and theoretical insight by their designers, but are often ultimately heuristic and exhibit unintended emergent behaviors. Sometimes these emergent behaviors are detrimental, but surprisingly, many provide unexpected desirable benefits. By theoretically characterizing these emergent behaviors, we can develop a more robust methods development process, where more and more of these desirable behaviors can be included by design and leveraged in powerful ways. I will discuss several examples of heuristics and emergent behavior: subsampling and sketching in linear regression and their equivalence to ridge regularization; empirical risk minimization and the universality of relative performances under distribution shifts; and adaptivity in dropout and feature learning models which are equivalent to parsimony-promoting sparse or low-rank regularization.

The Future Geometric Analysis of Optimization Problems in Signal Processing and Machine Learning

Abstract
High-dimensional data analysis and estimation appear in many signal processing and machine learning applications. The underlying low-dimensional structure in these high-dimensional data inspires us to develop optimality guarantees as well as optimization-based techniques for the fundamental problems in signal processing and machine learning. In recent years, non-convex optimization widely appears in engineering and is solved by many heuristic local algorithms, but lacks global guarantees. The recent geometric/landscape analysis provides a way to determine whether an iterative algorithm can reach global optimality. The landscape of empirical risk has been widely studied in a series of machine learning problems, including low-rank matrix factorization, matrix sensing, matrix completion, and phase retrieval. A favorable geometry guarantees that many algorithms can avoid saddle points and converge to local minima. In this presentation, I will discuss potential directions for the future geometric analysis of optimization problems in signal processing and machine learning.

Sparsity in Neural Networks: Science and Practice

Abstract
Sparsity has demonstrated its remarkable performance in the realm of model compression through the selectively eliminating a large portion of model parameters. Nevertheless, conventional methods to discover strong sparse neural networks often necessitate the training of an over-parameterized dense model, followed by iterative cycles of pruning and re-training. As the size of modern neural networks exponentially increases, the costs of dense pre-training and updates have become increasingly prohibitive. In this talk, I will introduce an approach that enables the training of sparse neural networks from scratch, without the need for any pre-training steps or dense updates. By achieving the property of over-parameterization in time, our approach demonstrates the capacity to achieve performance levels equivalent to fully dense networks while utilizing only a very small fraction of weights. Beyond the advantages in model compression, I will also elucidate a broader spectrum of benefits of sparsity in neural networks including scalability, robustness, and fairness, and great potentials build large-scale responsible AI.

Simulation-Calibrated Scientific Machine Learning

Abstract
Machine learning (ML) has achieved great success in a variety of applications suggesting a new way to build flexible, universal, and efficient approximators for complex high-dimensional data. These successes have inspired many researchers to apply ML to other scientific applications such as industrial engineering, scientific computing, and operational research, where similar challenges often occur. However, the luminous success of ML is overshadowed by persistent concerns that the mathematical theory of large-scale machine learning, especially deep learning, is still lacking and the trained ML predictor is always biased. In this talk, I’ll introduce a novel framework of (S)imulation-(Ca)librated (S)cientific (M)achine (L)earning (SCaSML), which can leverage the structure of physical models to achieve the following goals: 1) make unbiased predictions even based on biased machine learning predictors; 2) beat the curse of dimensionality with an estimator suffers from it. The SCASML paradigm combines a (possibly) biased machine learning algorithm with a de-biasing step design using rigorous numerical analysis and stochastic simulation. Theoretically, I’ll try to understand whether the SCaSML algorithms are optimal and what factors (e.g., smoothness, dimension, and boundness) determine the improvement of the convergence rate. Empirically, I’ll introduce different estimators that enable unbiased and trustworthy estimation for physical quantities with a biased machine learning estimator. Applications include but are not limited to estimating the moment of a function, simulating high-dimensional stochastic processes, uncertainty quantification using bootstrap methods, and randomized linear algebra.

Theoretical Foundations of Adversarially Robust Learning

Abstract
Despite extraordinary progress, current machine learning systems have been shown to be brittle against adversarial examples: seemingly innocuous but carefully crafted perturbations of test examples that cause machine learning predictors to misclassify. Can we learn predictors robust to adversarial examples? and how? There has been much empirical interest in this major challenge in machine learning, and in this talk, we will present a theoretical perspective. We will illustrate the need to go beyond traditional approaches and principles, such as empirical (robust) risk minimization, and present new algorithmic ideas with stronger robust learning guarantees.

Sparsity-aware generalization theory for deep neural networks

Abstract
Deep artificial neural networks achieve surprising generalization abilities that remain poorly understood. In this paper, we present a new approach to analyzing generalization for deep feed-forward ReLU networks that takes advantage of the degree of sparsity that is achieved in the hidden layer activations. By developing a framework that accounts for this reduced effective model size for each input sample, we are able to show fundamental trade-offs between sparsity and generalization. Importantly, our results make no strong assumptions about the degree of sparsity achieved by the model, and it improves over recent norm-based approaches. We illustrate our results numerically, demonstrating non-vacuous bounds when coupled with data-dependent priors in specific settings, even in over-parametrized models.

The Role of Parsimonious Structures in Data for Trustworthy Machine Learning

Abstract
This talk overviews recent theoretical results in the geometric foundations of adversarially robust machine learning. Modern ML classifiers can fail spectacularly when subject to specially crafted input-perturbations, called adversarial examples. On the other hand, we humans are quite robust for several tasks involving vision. Motivated by this disconnect, in the first part of this talk we will take a deeper dive into the question of when exactly we can avoid adversarial examples. We will see that a key geometric property of the data-distribution — concentration on small-volume subsets of the input space — characterizes whether any robust classifier exists. In particular, this suggests that natural image distributions are concentrated. In the second part of this talk, we will empirically instantiate these results for a few concentrated data-distributions, and discover that utilizing such structure in data leads to classifiers that enjoy better provable robustness guarantees in several regimes. This talk is based on work at NeurIPS ’23, ’20 and TMLR ’23.

On the Sparsity-Promoting Effect of Weight Decay in Deep Learning

Abstract
Deep learning has been wildly successful in practice and most state-of-the-art artificial intelligence systems are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this talk, I present a new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of trained neural networks through the lens of sparsity. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory. This framework explains the effect of weight decay regularization in neural network training, the importance of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.

Deep Interpretable Generative Learning for Science and Engineering

Abstract
Discriminative and generative AI are two deep learning paradigms that revolutionized prediction and generation of high-quality images from text prompts. Nonetheless, discriminative learning is unable to generate data, and deep generative models struggle with decoding capabilities. Moreover, both approaches are data-hungry and have low interpretability. These drawbacks have posed significant barriers to the adoption of deep learning in applications where a) acquiring supervised data is expensive or infeasible, and b) goals extend beyond data fitting to attain scientific insights. Furthermore, deep learning applications are fairly unexplored in fields with rich mathematical and optimization frameworks such as inverse problems, or those in which interpretability matters. This talk discusses the theory and applications of deep learning in data-limited or unsupervised inverse problems. These include applications in radar sensing, Poisson image denoising, and computational neuroscience.

Speeding up Large-Scale Machine Learning Model Development Using Low-Rank Models and Gradients

Abstract
Large-scale machine learning (ML) models, such as GPT-4 and Llama2, are at the forefront of advances in the field of AI. Nonetheless, developing these large-scale ML models demands substantial computational resources and a deep understanding of distributed ML and systems. In this presentation, I will introduce three frameworks, namely ATOMO, Pufferfish, and Cuttlefish, which use low-rank approximations on model gradients and model weights to significantly expedite ML model training. ATOMO is a general compression framework that has experimentally established that using low-rank gradients, as opposed to sparse ones, can lead to substantially faster distributed training. Pufferfish further bypasses the cost of compression by directly training low-rank models. However, directly training low-rank models usually leads to a loss in accuracy. Pufferfish mitigates this issue by training a full-rank model and then converting to a low-rank model early in the training process. Nonetheless, Pufferfish necessitates extra hyperparameter tuning, such as determining the optimal transition time from full-rank to low-rank. Cuttlefish addresses this issue by automatically estimating and adjusting these hyperparameters during training. I will present extensive experimental results on the distributed training of large-scale ML models, including LLMs, to demonstrate the efficacy of these frameworks.

Understanding Hierarchical Representations in Deep Networks via Intermediate Features

Abstract
Over the past decade, deep learning has proven to be a highly effective method for learning meaningful features from raw data. This work attempts to unveil the mystery of hierarchical feature learning in deep networks. Specifically, in the context of multi-class classification problems, we explore how deep networks transform input data by investigating the output (i.e., features) of each layer after training. Towards this goal, we first define metrics for within-class compression and between-class discrimination of intermediate features, respectively. Through an analysis of these two metrics, we show that the evolution of features follows a simple and quantitative law from shallow to deep layers: Each layer of linear networks progressively compresses within-class features at a linear rate and discriminates between-class features at a sublinear rate. To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep networks. Moreover, our extensive experiments validate our theoretical findings numerically.

White-Box Transformers via Sparse Rate Reduction

Abstract
In this talk, I will present the white-box transformer — CRATE (i.e., Coding RAte reduction TransformEr). We contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing lossy coding rate. This leads to a family of white-box transformer architectures which are mathematically interpretable. Our experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers (ViTs). I will also present some recent theoretical and empirical results of CRATE on emergence behavior, language modeling, and auto-encoding.

Decoding the Information Bottleneck in Self-Supervised Learning: Pathway to Optimal Representation

Abstract
Deep Neural Networks (DNNs) have excelled in many fields, largely due to their proficiency in supervised learning tasks. However, the dependence on vast labeled data becomes a constraint when such data is scarce. Self-Supervised Learning (SSL), a promising approach, harnesses unlabeled data to derive meaningful representations. Yet, how SSL filters irrelevant information without explicit labels remains unclear. In this talk, we aim to unravel the enigma of SSL using the lens of Information Theory, with a spotlight on the Information Bottleneck principle. This principle, while providing a sound understanding of the balance between compressing and preserving relevant features in supervised learning, presents a puzzle when applied to SSL due to the absence of labels during training. We will delve into the concept of ‘optimal representation’ in SSL, its relationship with data augmentations, optimization methods, and downstream tasks, and how SSL training learns and achieves optimal representations. Our discussion unveils our pioneering discoveries, demonstrating how SSL training naturally leads to the creation of optimal, compact representations that correlate with semantic labels. Remarkably, SSL seems to orchestrate an alignment of learned representations with semantic classes across multiple hierarchical levels, an alignment that intensifies during training and grows more defined deeper into the network. Considering these insights and their implications for class set performance, we conclude our talk by applying our analysis to devise more robust SSL-based information algorithms. These enhancements in transfer learning could lead to more efficient learning systems, particularly in data-scarce environments.

Key Dates and Deadlines

All deadlines are 23:59 Anywhere-on-Earth (AOE)

Event Date
Submission Deadline
August 28th, 2023
Reviews Released, Rebuttal Stage Begins
October 16th, 2023
Rebuttal Stage Ends, Author-Reviewer Discussion Begins
October 27th, 2023
Author-Reviewer Discussion Ends
November 5th, 2023
Final Decisions Released
November 20th, 2023
Camera-Ready Deadline
December 5th, 2023
Event Date
Submission Deadline
October 10th, 2023
Final Decisions Released
November 20th, 2023
Camera-Ready Deadline
December 5th, 2023
Event Date
Application Deadline
October 1st, 2023
Decisions Released
October 23rd, 2023
Event Date
Registration Deadline
December 15th, 2023
Conference at HKU
January 3rd - January 6th, 2024

Upcoming news

Photo gallery

Local Host and Main Sponsor

Platinum Sponsor

Gold Sponsors

 

Silver Sponsors

 

Rising Star Award Sponsor

Yoga Sponsor