msr asia theory lecture series is a forum where we invite researchers around the world to share the latest theoretical advances in big data, artificial intelligence, and related areas. the lecture series are broadcast live over teams and . if you would like to receive the information about the upcoming talks, please send email “subscribe to the lecture series” to .
lectures
10/23/2023: the mathematics of complex streamed data, terry lyons

abstract: complex streams of evolving data are better understood by their effects on nonlinear systems than by their values at times. the question of which nonlinear systems would seem to be context dependent, but it is not. core to rough path theory is a simple universal nonlinear system that captures all the information needed to predict any response to any nonlinear system. this idealized mathematical feature set is known as the signature of the stream. its abstract simplicity opens the possibilities for understanding and working with streams in the same context free way that calculators work with numbers. signaturebased techniques offer simple to apply universal numerical methods that are robust to irregular data and efficient at representing the order of events and complex oscillatory data. specific software can be developed and then applied across many contexts. signatures underpin prize winning contributions in recognizing chinese handwriting, in detecting sepsis, and in generating financial data, and most recently in the ability to score streams as outliers against a corpus of normal streams. this principled outlier technology has emerged as a powerful unifying technique; it identifies radio frequency interference in astronomical data, brain injury from meg data…. the underpinning theoretical contributions span a range from abstract algebra and noncommutative analysis to questions of organization of efficient numerical calculation. see . new hyperbolic partial differential equations have been developed that compute the “signature kernel” trick without ever having to introduce signatures. neural controlled differential equations can directly harness approaches such as the log ode method and consume the control as a rough path.
bio:
professor terry lyons is wallis professor emeritus and professor of mathematics at the university of oxford. he is currently pi of the datasıg program (primarily funded by epsrc), and of the complementary research programme cimdaoxford (under the support of innohk and the hksar). he was a founding member (2007) of, and then director (20112015) of, the oxford man institute of quantitative finance; he was the director of the wales institute of mathematical and computational sciences (wimcs; 20082011). he came to oxford in 2000 having previously been professor of mathematics at imperial college london (19932000), and before that he held the colin maclaurin chair at edinburgh (198593). he was president of the london mathematical society (201315).
professor lyons’s longterm research interests are focused on the mathematics of streamed data and building strong applications from these mathematical insights. his current goal is to use rough path theory to develop innovative and truly generic tools for working with streamed data and make these widely accessible through the python package roughpy. one example of this synergy comes from the signature of a stream. signatures underpin prize winning contributions in recognizing chinese handwriting, in detecting sepsis, and in generating financial data, and most recently in the ability to score streams as outliers against a corpus of normal streams. this principled outlier technology has emerged as a powerful unifying technique; it identifies radio frequency interference in astronomical data, brain injury from meg data…. the underpinning theoretical contributions span a range from abstract algebra and noncommutative analysis to questions of organization of efficient numerical calculation. see
10/18/2023: is rlhf more difficult than standard rl?, chi jin

abstract: reinforcement learning from human feedback (rlhf) learns from preference signals, while standard reinforcement learning (rl) directly learns from reward signals. preferences arguably contain less information than rewards, which makes preferencebased rl seemingly more difficult. this work theoretically proves that, for a wide range of preference models, we can solve preferencebased rl directly using existing algorithms and techniques for rewardbased rl, with small or no extra costs. specifically, (1) for preferences that are drawn from rewardbased probabilistic models, we reduce the problem to robust rewardbased rl that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von neumann winner, we reduce the problem to multiagent rewardbased rl which finds nash equilibria for factored markov games under a restricted set of policies. the latter case can be further reduced to adversarial mdp when preferences only depend on the final state. we instantiate all rewardbased rl subroutines by concrete provable algorithms and apply our theory to a large class of models including tabular mdps and mdps with generic function approximation. we further provide guarantees when kwise comparisons are available.
bio: chi jin is an assistant professor at the electrical and computer engineering department of princeton university. he obtained his phd degree in computer science at university of california, berkeley, advised by michael i. jordan. his research mainly focuses on theoretical machine learning, with special emphasis on nonconvex optimization and reinforcement learning (rl). in nonconvex optimization, he provided the first proof showing that firstorder algorithm (stochastic gradient descent) is capable of escaping saddle points efficiently. in rl, he provided the first efficient learning guarantees for qlearning and leastsquares value iteration algorithms when exploration is necessary. his works also lay the theoretical foundation for rl with function approximation, multiagent rl and partially observable rl
8/24/2023: on the power of foundation models, yang yuan

abstract: with infinitely many highquality data points, infinite computational power, an infinitely large foundation model with a perfect training algorithm and guaranteed zero generalization error on the pretext task, can the model be used for everything? this question cannot be answered by the existing theory of representation, optimization or generalization, because the issues they mainly investigate are assumed to be nonexistent here. in this paper, we show that category theory provides powerful machinery to answer this question. we have proved three results. the first one limits the power of promptbased learning, saying that the model can solve a downstream task with prompts if and only if the task is representable. the second one says fine tuning does not have this limit, as a foundation model with the minimum required power (up to symmetry) can theoretically solve downstream tasks for the category defined by pretext task, with fine tuning and enough resources. our final result can be seen as a new type of generalization theorem, showing that the foundation model can generate unseen objects from the target category (e.g., images) using the structural information from the source category (e.g., texts). along the way, we provide a categorical framework for supervised and selfsupervised learning, which might be of independent interest.
bio: yang yuan is now an assistant professor at iiis, tsinghua. he finished his undergraduate study at peking university in 2012. afterwards, he received his phd at cornell university in 2018, advised by professor robert kleinberg. before joining tsinghua, he spent one year at mit institute for foundations of data science (mifods) as a postdoc researcher. he works on ai healthcare, ai theory and applied category theory.
4/27/2023: understanding adam and adamw through proximal updates, scalefreeness, and relaxed smoothness, francesco orabona

abstract: adam and adamw are the most commonly used algorithms for training deep neural networks due to their remarkable performance. however, despite a massive amount of research, it is fair to say that we are still far from understanding the true reasons why they work so well. in this talk, i’ll show you some recent results on unique characteristics of adam and adamw.
first, i’ll show how adamw can be easily understood as an approximation of a proximal update on the squared l2 regularizer. next, i’ll show that, contrary to adam, adamw’s update is “scalefree”, i.e., its update is invariant to componentwise rescaling of the gradients. i’ll show how scalefreeness provides an automatic preconditioning and how it correlates with the better performance of adamw over adam on deep learning experiments. finally, i’ll show the first analysis of a (minor) variant of adam, that has a provably advantage over sgd for functions that satisfy a relaxed smoothness assumption, like the objective functions of transformers.bio: francesco orabona is an associate professor of electrical & computer engineering at boston university. his research interests lie in online learning, optimization, and statistical learning theory. he obtained his ph.d. from the university of genova in 2007. he previously was an assistant professor of computer science at stony brook university, a senior research scientist at yahoo labs, and a research assistant professor at the toyota technological institute at chicago. he received a faculty early career development (career) from nsf in 2021 and a google research award in 2017.
3/10/2023: modeling multiagent game dynamics: approaches to equilibrium computation and incentive analysis, xiaotie deng

abstract: this talk explores various research approaches to modeling the computation of equilibria and analysis of incentives in game dynamics. we discuss computational complexity, sequential and interactive optimization, and equilibrium analysis in multiagent systems.
bio: xiaotie deng is a chair professor at peking university with a ph.d. from stanford university. his research focuses on algorithmic game theory, particularly in the context of the internet and blockchain economics. deng has taught at several universities and is a fellow of the acm, ieee, and csiam. he is a foreign member of academia europaea and received the 2022 test of time award from acm sigecom.
1/12/2023: passive and active multitask representation learning, simon (shaolei) du

abstract: representation learning has been widely used in many applications. in this talk, i will present our work which uncovers when and why representation learning provably improves the sample efficiency, from a statistical learning point of view. furthermore, i will talk about how to actively select the most relevant task to boost the performance.
bio: simon shaolei du is an assistant professor in the paul g. allen school of computer science & engineering at the universityof washington. his research interests are broadly in machine learning, such as deep learning, representation learning, and reinforcement learning. prior to starting as faculty, he was a postdoc at the institute for advanced study of princeton. he completed his ph.d. in machine learning at carnegie mellon university. simon’s research has been recognized by a samsung ai researcher of the year award, an nsf career award, an nvidia pioneer award, a aaai new faculty highlights, and a distinguished dissertation award honorable mention from cmu.
12/22/2022: rewardfree reinforcement learning via sampleefficient representation learning, yingbin liang

abstract: as rewardfree reinforcement learning (rl) becomes a powerful framework for a variety of multiobjective applications, representation learning arises as an effective technique to deal with the curse of dimensionality in rewardfree rl. however, the existing algorithms of representation learning in rewardfree rl still suffers seriously from high sample complexity, although they are polynomially efficient. in this talk, i will first present a novel representation learning algorithm that we propose for rewardfree rl. we show that such an algorithm provably finds nearoptimal policy as well as attaining nearaccurate system identification via rewardfree exploration, with significantly improved sample complexity compared to the bestknown result before. i will then present our characterization of the benefit of representation learning in rewardfree multitask (a.k.a. meta) rl as well as the benefit of employing the learned representation from upstream to downstream tasks. i will conclude my talk with remarks of future directions. the work to be presented was jointly with yuan cheng (ustc), ruiquan huang (psu), dr. songtao feng (osu), prof. jing yang (psu), and prof. hong zhang (ustc).
bio: dr. yingbin liang is currently a professor at the department of electrical and computer engineering at the ohio state university (osu), and a core faculty of the ohio state translational data analytics institute (tdai). she also serves as the deputy director of the aiedge institute at osu. dr. liang received the ph.d. degree in electrical engineering from the university of illinois at urbanachampaign in 2005, and served on the faculty of university of hawaii and syracuse university before she joined osu. dr. liang’s research interests include machine learning, optimization, information theory, and statistical signal processing. dr. liang received the national science foundation career award and the state of hawaii governor innovation award in 2009. she also received eurasip best paper award in 2014.
11/04/2022: playeroptimal stable regret for bandit learning in matching markets, shuai li

abstract: the problem of matching markets has been studied for a long history in the literature due to its wide range of applications. finding a stable matching is a common equilibrium objective in this problem. since market participants are usually uncertain of their preferences, a rich line of recent works study the online setting where oneside participants (players) learn their unknown preferences from iterative interactions with the other side (arms). most previous works in this line are only able to derive theoretical guarantees for playerpessimal stable regret, which is defined compared with the players’ leastpreferred stable matching.
however, under the pessimal stable matching, players only obtain the least reward among all stable matchings. to maximize players’ profits, the playeroptimal stable matching would be the most desirable though basu et al. [2021] successfully bring an upper bound for playeroptimal stable regret, their result can be exponentially large if players’ preference gap is small. whether a polynomial guarantee for this regret exists is a significant but still open problem. in this work, we provide a new algorithm and show that the optimal stable regret of each player can be upper bounded by o(k log t / ∆^2) where k is the number of arms, t is the horizon and ∆ is the players’ minimum preference gap. this result significantly improves previous works which either has a weaker playerpessimal stable matching objective or applies only for markets with special assumptions. when the preferences of participants satisfy some special conditions, our regret upper bound also matches the previously derived lower bound this work is accepted at soda 2023.
bio: shuai li is currently an assistant professor in the john hopcroft center of shanghai jiao tong university. she received phd degree in computer science from the chinese university of hong kong, master degree in mathematics from university of the chinese academy of sciences and bachelor degree in mathematics from zhejiang university. her research interests include machine learning theory, bandit algorithms and reinforcement learning algorithms. she has published 40 papers in top machine learning conferences like icml/neurips/aaai/ijcai/kdd and serves as reviewers in these conferences. she is a recipient of shanghai sailing program 2020 and google phd fellowship 2018.
video
10/13/2022: what should a good deep neural network look like? insights from a layerpeeled model and the law of equiseparation, weijie su

abstract: in this talk, we will investigate the emergence of geometric patterns in welltrained deep learning models by making use of a layerpeeled model and the law of equiseparation. the former is a nonconvex optimization program that models the lastlayer features and weights. we use the model to shed light on the neural collapse phenomenon of papyan, han, and donoho, and to predict a hithertounknown phenomenon that we term minority collapse in imbalanced training. this is based on joint work with cong fang, hangfeng he, and qi long.
the law of equiseparation is a pervasive empirical phenomenon that describes how data are separated according to their class membership from the bottom to the top layer in a welltrained neural network. we will show that, through extensive computational experiments, neural networks improve data separation through layers in a simple exponential manner. this law leads to roughly equal ratios of separation that a single layer is able to improve, thereby showing that all layers are created equal. we will conclude the talk by discussing the implications of this law on the interpretation, robustness, and generalization of deep learning, as well as on the inadequacy of some existing approaches toward demystifying deep learning. this is based on joint work with hangfeng he.
bio: weijie su is an associate professor in the wharton statistics and data science department and, by courtesy, in the department of computer and information science, at the university of pennsylvania. he is a codirector of penn research in machine learning. prior to joining penn, he received his ph.d. from stanford university in 2016 and his bachelor’s degree from peking university in 2011. his research interests span privacypreserving data analysis, deep learning theory, optimization, highdimensional statistics, and mechanism design. he is a recipient of the stanford theodore anderson dissertation award in 2016, an nsf career award in 2019, an alfred sloan research fellowship in 2020, the siam early career prize in data science in 2022, and the ims peter gavin hall prize in 2022.
09/22/2022: on the (non)smoothness of neural network training, jingzhao zhang

abstract: in this talk, we will discuss the following question―why is neural network training nonsmooth from an optimization perspective, and how should we analyze convergence for non smooth problems. we start by showing that the nonsmoothness is essential to standard neural network training procedures, and that network training converges in an unstable manner. we then provide theoretical models for understanding why optimization in neural network is unstable, and how new definitions of convergence can reconcile theory with practice.
bio: jingzhao zhang is an assistant professor at tsinghua, iiis. he graduated from mit eecs under the supervision of prof ali jadbabaie and prof suvrit sra. his research focuses on providing theoretical justifications and analyses to practical largescale optimization algorithms. he is also interested in machine learning applications, especially those involving dynamical system formulations.
08/25/2022: local elasticity of neural networks and its inspired theory, zhun deng

abstract: in this talk, i will briefly review local elasticity of neural networks proposed by he et al. then, based on that, i will introduce a new type of stability notion, which can improve over classical stability notions with respect to generalization behavior in certain situations. specifically, among different notions of stability, uniform stability is arguably the most popular one, which yields exponential generalization bounds. however, uniform stability only considers the worstcase loss change (or socalled sensitivity) by removing a single data point, which is distributionindependent and therefore undesirable. there are many cases that the worstcase sensitivity of the loss is much larger than the average sensitivity taken over the single data point that is removed, especially in some advanced models such as random feature models or neural networks. many previous works try to mitigate the distribution independent issue by proposing weaker notions of stability, however, they either only yield polynomial bounds or the bounds derived do not vanish as sample size goes to infinity. given that, we propose locally elastic stability as a weaker and distributiondependent stability notion, which still yields exponential generalization bounds. we further demonstrate that locally elastic stability implies tighter generalization bounds than those derived based on uniform stability in many situations by revisiting the examples of bounded support vector machines, regularized least square regressions, and stochastic gradient descent.
bio: zhun is a postdoctoral researcher with and at columbia university, and also part of. previously, zhun got his ph.d. in computer science at harvard university, advised by . his research interests lie at the intersection of theoretical computer science, machine learning, and social science. his work aims to make data science more trustworthy, statistically rigorous, and aligned with societal values. here is the website: .
08/04/2022: toward understanding selfsupervised pretraining, tengyu ma

abstract: ai is undergoing a paradigm shift the rise of models that are pretrained with selfsupervised learning and then adapted to a wide range of downstream tasks. despite the unprecedented empirical success, why and how pretrained models work still largely remains a mystery. this talk will discuss recent works on analyzing contrastive learning, a family of popular selfsupervised pretraining methods that learn visual representations/embeddings of images from unlabeled data. we will develop a framework that views contrastive learning as a parametric version of spectral clustering on a socalled population positivepair graph. we will also analyze the adaptability of the representations and provide sample complexity bounds. finally, i will briefly discuss two followup works that study selfsupervised representations’ performance under imbalanced pretraining datasets and for shifting test distributions. joint works with jeff z. haochen, colin wei, kendrick shen, robbie jones, ananya kumar, sang michael xie, adrien gaidon, and percy liang.
bio: tengyu ma is an assistant professor of computer science and statistics at stanford university. he received his ph.d. from princeton university and b.e. from tsinghua university. his research interests include topics in machine learning and algorithms, such as deep learning and its theory, nonconvex optimization, deep reinforcement learning, representation learning, and highdimensional statistics. he is a recipient of the acm doctoral dissertation award honorable mention, the sloan fellowship, and nsf career award.

07/22/2022: unveiling transformers with lego, sebastien bubeck

abstract:
the discovery of the transformer architecture was a paradigm shifting event for deep learning. however, these architectures are arguably even harder to understand than say convolutional neural networks. in this work we propose a synthetic task, called lego, to probe the inner workings of transformers. we obtain some insights on multihead attention, the effect of pretraining, as well as overfitting issues. joint work with yi zhang, arturs backurs, ronen eldan, suriya gunasekar, and tal wagner.
bio: sebastien bubeck manages the machine learning foundations team in msr redmond. he has worked on multiarmed bandits, convex optimization, online algorithms, and adversarial examples, winning best papers at colt (2009 and 2016), alt (2018), and neurips (2018 and 2021). at the moment he is trying to understand transformers.