**The Multifaceted Complexity of Machine Learning**

### April 12-16, 2021

**This workshop will take place online.**

**Organizers**

- Avrim Blum (Toyota Technological Institute of Chicago)
- Olgica Milenkovic (Electrical and Computer Engineering, UIUC)
- Lev Reyzin (Mathematics, UIC)
- Matus Telgarsky (Computer Science, UIUC)
- Rebecca Willett (Statistics and Computer Science, Chicago)

**Description**

Modern machine learning (ML) methods, coupled with new optimization and statistical inference strategies, have demonstrated an unprecedented potential to solve challenging problems in computer vision, natural language processing, healthcare, agriculture, and other application areas. However, foundational understanding regarding how and when certain methods are adequate to use and most effective in solving tasks of interest is still emerging. A central question at the heart of this endeavor is to understand the different facets of the complexity of machine learning tasks. These include sample complexity, computational complexity, Kolmogorov complexity, oracle complexity, memory complexity, model complexity, and the stationarity of the learning problem. This workshop will focus on developing a better understanding of these different types of complexity within machine learning, how they can be jointly leveraged to understand the solvability of learning problems, and fundamental trade-offs among them.

**Confirmed speakers**

- Jayadev Acharya (Cornell University)
- Peter Bartlett (UC Berkeley)
- Kamalika Chaudhuri (UC San Diego)
- Jelena Diakonikolas (University of Wisconsin Madison)
- Vitaly Feldman (Apple)
- Surbhi Goel (Microsoft Research NYC)
- Moritz Hardt (UC Berkeley)
- Daniel Hsu (Columbia University)
- Stephanie Jegelka (MIT)
- Adam Klivans (University of Texas at Austin)
- Andreas Krause (ETH Zurich)
- Aryeh Kontorovich (Ben-Gurion University)
- Samory Kpotufe (Columbia University)
- Po-Ling Loh (University of Wisconsin Madison)
- Andrei Risteski (Carnegie Mellow University)
- Tselil Schramm (Stanford University)
- Gregory Valiant (Stanford University)
- Rachel Ward (University of Texas at Austin)

**Monday, April 12**

All times CST8:50-9:00 | Introductory remarks: IMSI Director and workshop organizers |

Session chair: Lev Reyzin (University of Illinois at Chicago) | |

9:00-9:45 | Daniel Hsu (Columbia University) Contrastive learning is a “self-supervised” approach to representation learning that uses naturally occurring similar and dissimilar pairs of data points to find useful embeddings of data. We study contrastive learning in the context of multi-view statistical models. First, we show that whenever the views of the data are approximately redundant in their ability to predict a target function, a low-dimensional embedding obtained via contrastive learning affords a linear predictor with near-optimal predictive accuracy. Second, we show that in the context of topic models, the embedding can be interpreted as a linear transformation of the posterior moments of the hidden topic distribution given the observed words. We also empirically demonstrate that linear classifiers with these representations perform well in document classification tasks with very few labeled examples in a semi-supervised setting. This is joint work with Akshay Krishnamurthy (MSR) and Christopher Tosh (Columbia). Contrastive learning, multi-view redundancy, and linear models |

9:45-10:30 | Discussion and Break |

10:30-11:15 | Aryeh Kontorovich (Ben-Gurion University) We initiate a program of average smoothness analysis for efficiently learning real-valued functions on metric spaces. Rather than using the Lipschitz constant as the regularizer, we define a local slope at each point and gauge the function complexity as the average of these values. Since the mean can be dramatically smaller than the maximum, this complexity measure can yield considerably sharper generalization bounds — assuming that these admit a refinement where the Lipschitz constant is replaced by our average of local slopes. Our first major contribution is to obtain just such distribution-sensitive bounds. This required overcoming a number of technical challenges, perhaps the most formidable of which was bounding the {em empirical} covering numbers, which can be much worse-behaved than the ambient ones. Our combinatorial results are accompanied by efficient algorithms for smoothing the labels of the random sample, as well as guarantees that the extension from the sample to the whole space will continue to be, with high probability, smooth on average. Along the way we discover a surprisingly rich combinatorial and analytic structure in the function class we define. This is joint work with Yair Ashlagi and Lee-Ad Gottlieb. Functions with average smoothness: structure, algorithms, and learning |

11:15-12:00 | Discussion and Break |

12:00-1:00 | Lunch Break |

1:00-1:45 | Peter Bartlett (University of California, Berkeley) Deep learning methodology has revealed some major surprises from the perspective of statistical complexity: even without any explicit effort to control model complexity, these methods find prediction rules that give a near-perfect fit to noisy training data and yet exhibit excellent prediction performance in practice. We investigate this phenomenon of ‘benign overfitting’ in the setting of linear prediction and give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization shows that overparameterization is essential: the number of directions in parameter space that are unimportant for prediction must be large compared to the sample size. We discuss implications for deep networks, for robustness to adversarial examples, and for the rich variety of possible behaviors of excess risk as a function of dimension, and we describe extensions to ridge regression and barriers to analyzing benign overfitting based on model-dependent generalization bounds. Joint work with Phil Long, Gábor Lugosi, and Alex Tsigler. Benign overfitting |

1:45-2:30 | Discussion and Break |

2:30-3:15 | Po-Ling Loh (University of Wisconsin Madison) We study the problem of linear regression where both covariates and responses are potentially (i) heavy-tailed and (ii) adversarially contaminated. Several computationally efficient estimators have been proposed for the simpler setting where the covariates are sub-Gaussian and uncontaminated; however, these estimators may fail when the covariates are either heavy-tailed or contain outliers. In this work, we show how to modify the Huber regression, least trimmed squares, and least absolute deviation estimators to obtain estimators which are simultaneously computationally and statistically efficient in the stronger contamination model. Our approach is quite simple, and consists of applying a filtering algorithm to the covariates, and then applying the classical robust regression estimators to the remaining data. We show that the Huber regression estimator achieves near-optimal error rates in this setting, whereas the least trimmed squares and least absolute deviation estimators can be made to achieve near-optimal error after applying a postprocessing step. Robust regression with covariate filtering: Heavy tails and adversarial contamination |

3:15-4:00 | Discussion and Break |

**Tuesday, April 13**

Session chair: Avrim Blum (Toyota Technological Institute at Chicago) | |

9:00-9:45 | Rachel Ward (University of Texas at Austin) Random feature methods have been successful in various machine learning tasks, are easy to compute, and come with theoretical accuracy bounds. They serve as an alternative approach to standard neural networks since they can represent similar function spaces without a costly training phase. However, for accuracy, random feature methods require more measurements than trainable parameters, limiting their use for data-scarce applications or problems in scientific machine learning. This paper introduces the sparse random feature method that learns parsimonious random feature models utilizing techniques from compressive sensing. We provide uniform bounds on the approximation error for functions in a reproducing kernel Hilbert space depending on the number of samples and the distribution of features. The error bounds improve with additional structural conditions, such as coordinate sparsity, compact clusters of the spectrum, or rapid spectral decay. We show that the sparse random feature method outperforms shallow networks for well-structured functions and applications to scientific machine learning tasks. Sparse Random Features for Function Approximation |

9:45-10:30 | Discussion and Break |

10:30-11:15 | Gregory Valiant (Stanford University) In the first portion of the talk, I will introduce the problem of data “amplification”. Given n independent draws from a distribution, D, to what extent is it possible to output a set of m > n datapoints that are indistinguishable from m i.i.d. draws from D? Curiously, we show that nontrivial amplification is often possible in the regime where n is too small to learn D to any nontrivial accuracy. Next, I will discuss a new framework for modeling learning and estimation in the many settings where the data is non-iid (e.g. due to non-stationarity, correlated sampling, etc.). Surprisingly, this framework provides some new (and improved!) estimators for some basic classical settings. Finally (and time dependent), I will discuss linear regression and convex optimization under memory constraints. The emphasis throughout will be on the many open directions for future research provided by these new perspectives and frameworks. New Problems and Perspectives on Sampling, Memory, and Learning |

11:15-12:00 | Discussion and Break |

12:00-1:00 | Lunch Break |

1:00-1:45 | Adam Klivans (University of Texas at Austin) It has been known for decades that a polynomial-size training sample suffices for learning neural networks. Most theoretical results, however, indicate that these learning tasks are computationally intractable. Where exactly is the dividing line? In this talk we consider arguably the most well-studied scenario, namely when the marginal distribution on inputs is Gaussian, and show unconditionally that gradient descent cannot learn even one-layer neural networks. We then point to a potential way forward and describe the first fixed-parameter tractable algorithm for learning deep ReLU networks. Its running time is a fixed polynomial in the ambient dimension and exponential in only the network’s parameters. This is the first nontrivial positive result for networks of depth more than two. Computational/Statistical Gaps for Learning Neural Networks |

1:45-2:30 | Discussion and Break |

2:30-3:15 | Moritz Hardt (University of California, Berkeley) When predictive models support decisions they can influence the outcome they aim to predict. We call such predictions performative; the prediction influences the target. Performativity is a well-studied phenomenon in policy-making that has so far been neglected in supervised learning. When ignored, performativity surfaces as undesirable distribution shift, routinely addressed with retraining. In this talk, I will describe a risk minimization framework for performative prediction bringing together concepts from statistics, game theory, and causality. A new element is an equilibrium notion called performative stability. Performative stability implies that the predictions are calibrated not against past outcomes, but against the future outcomes that manifest from acting on the prediction. I will then discuss recent results on performative prediction including necessary and sufficient conditions for the convergence of retraining to a performatively stable point of nearly minimal loss. Joint work with Juan C. Perdomo, Tijana Zrnic, and Celestine Mendler-Dünner. Performative Prediction |

3:15-4:00 | Discussion and Break |

**Wednesday, April 14**

Session chair: Olgica Milenkovic (University of Illinois at Urbana-Champaign) | |

9:00-9:45 | Jayadev Acharya (Cornell University) Learning from data under information constraints: Fundamental limits and applications |

9:45-10:30 | Discussion and Break |

10:30-11:15 | Jelena Diakonikolas (University of Wisconsin Madison) Min-max optimization constitutes a fundamental class of problems with a range of applications in areas such as game theory, engineering, finance, and training of adversarial models in machine learning. In this talk, I will provide an overview of smooth convex-concave min-max optimization. I will then discuss connections between smooth min-max optimization and the problem of finding a fixed point of a nonexpansive (1-Lipschitz) map. This connection allows us to use variants of Halpern iteration – a classical method for finding fixed points of nonexpansive maps – to solve broad classes of min-max optimization problems. The resulting algorithms attain near-optimal convergence rates for different classes of smooth min-max problems, are parameter-free, and lead to guarantees both in terms of approximating the associated optimality gap and the appropriate notion of stationarity (in unconstrained optimization, an approximate stationary point is a point with small gradient). These are the first results that can simultaneously achieve all of the aforementioned guarantees and, up to logarithmic factors, completely resolve the complexity of smooth min-max optimization. On Min-Max Optimization and Halpern Iteration |

11:15-12:00 | Discussion and Break |

12:00-1:00 | Lunch Break |

1:00-1:45 | Vitaly Feldman (Apple AI Research) Deep learning algorithms that achieve state-of-the-art results on image and text recognition tasks tend to fit the entire training dataset (nearly) perfectly including mislabeled examples and outliers. This propensity to memorize seemingly useless data and the resulting large generalization gap have puzzled many practitioners and is not explained by existing theories of machine learning. We provide a simple conceptual explanation and a theoretical model demonstrating that memorization of outliers and mislabeled examples is necessary for achieving close-to-optimal generalization error when learning from long-tailed data distributions. Image and text data are known to follow such distributions and therefore our results establish a formal link between these empirical phenomena. We then demonstrate the utility of memorization and support our explanation empirically. These results rely on a new technique for efficiently estimating memorization and influence of training data points. Based on a joint work with Chiyuan Zhang. Chasing the Long Tail: What Neural Networks Memorize and Why |

**Thursday, April 15**

Session chair: Matus Telgarsky (University of Illinois at Urbana-Champaign) | |

9:00-9:45 | Stefanie Jegelka (MIT) Graph Neural Networks (GNNs) have become a popular tool for learning representations of graph-structured inputs, with applications in computational chemistry, recommendation, pharmacy, reasoning, and many other areas. In this talk, I will show some recent results on learning with message-passing GNNs. In particular, GNNs possess important invariances and inductive biases that affect learning and generalization. We relate these properties and the choice of the “aggregation function” to predictions within and outside the training distribution. This talk is based on joint work with Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken-ichi Kawarabayashi, Vikas Garg and Tommi Jaakkola. Learning in Graph Neural Networks |

9:45-10:30 | Discussion and Break |

10:30-11:15 | Andrej Risteski (Carnegie Mellon University) Many tasks involving generative models involve being able to sample from distributions parametrized as p(x) = e^{-f(x)}/Z where Z is the normalizing constant, for some function f whose values and gradients we can query. This mode of access to f is natural — for instance sampling from posteriors in latent-variable models. Classical results show that a natural random walk, Langevin diffusion, mixes rapidly when f is convex. Unfortunately, even in simple examples, the applications listed above will entail working with functions f that are nonconvex. Paralleling the move away from convexity in optimization, we initiate a study of instances of relevance in practice where Langevin diffusion (combined with other tools) can provably be shown to mix rapidly: distributions p that are multimodal, as well as distributions p that have a natural manifold structure on their level sets. Sampling Beyond Log-concavity |

11:15-12:00 | Discussion and Break |

12:00-1:00 | Lunch Break |

1:00-1:45 | Tselil Schramm (Stanford University) In many high-dimensional statistics problems, we observe information-computation tradeoffs: given access to more data, statistical estimation and inference tasks require fewer computational resources. Though this phenomenon is ubiquitous, we lack rigorous evidence that it is inherent. In the current day, to prove that a statistical estimation task is computationally intractable, researchers must prove lower bounds against each type of algorithm, one by one, resulting in a “proliferation of lower bounds”. We scientists dream of a more general theory which explains computational intractability in an algorithm-independent way. In this talk, I will make one small step towards realizing this dream. I will demonstrate general conditions under which two popular frameworks yield the same information-computation tradeoffs for high-dimensional hypothesis testing: the first being statistical queries in the “statistical dimension” framework, and the second being hypothesis testing with low-degree hypothesis tests. Our equivalence theorems capture numerous well-studied high-dimensional learning problems: sparse PCA, tensor PCA, planted clique, and more. Based on joint work with Matthew Brennan, Guy Bresler, Samuel B. Hopkins and Jerry Li. Statistical Query Algorithms and Low-Degree Tests Are Almost Equivalent |

1:45-2:30 | Discussion and Break |

2:30-3:15 | Surbhi Goel (Microsoft Research NYC) A major challenge in the theory of deep learning is to understand the computational complexity of learning basic families of neural networks (NNs). In fact, learning even a single Rectified Linear Unit (ReLU), which is one of the most commonly used activation functions in deep learning, can be challenging. The challenge here arises from the non-convexity of the associated optimization problem. In this talk, I will give an overview of hardness results for the problem of learning ReLUs with noisy labels under distribution independent as well as distribution dependent settings. I will subsequently describe positive results that nearly match these lower bounds. As part of the positive results, I will also describe a simple gradient-based algorithm which approximately learns a ReLU in polynomial time for any well-behaved distribution. Computational Complexity of Learning ReLUs |

3:15-4:00 | Discussion and Break |

**Friday, April 16**

Session chair: Rebecca Willett (University of Chicago) | |

9:00-9:45 | Samory Kpotufe (Columbia University) A common situation in Machine Learning is one where training data is not fully representative of a target population due to bias in the sampling mechanism or due to prohibitive target sampling costs. In such situations, we aim to ’transfer’ relevant information from the training data (a.k.a. source data) to the target application. How much information is in the source data about the target application? Would some amount of target data improve transfer? These are all practical questions that depend crucially on ‘how far’ the source domain is from the target. However, how to properly measure ‘distance’ between source and target domains remains largely unclear. In this talk we will argue that much of the traditional notions of ‘distance’ (e.g. KL-divergence, extensions of TV such as D_A discrepancy, density-ratios, Wasserstein distance) can yield an over-pessimistic picture of transferability. Instead, we show that some new notions of ‘relative dimension’ between source and target (which we simply term ‘transfer-exponents’) capture a continuum from easy to hard transfer. Transfer-exponents uncover a rich set of situations where transfer is possible even at fast rates; they encode relative benefits of source and target samples, and have interesting implications for related problems such as ‘multi-task or multi-source learning’. In particular, in the case of transfer from multiple sources, we will discuss (if time permits) a strong dichotomy between minimax and adaptive rates: no adaptive procedure exists that can achieve the same rates as minimax (oracle) procedures. Some Recent Insights on Transfer-Learning |

9:45-10:30 | Discussion and Break |

10:30-11:15 | Andreas Krause (ETH Zurich) The exploration—exploitation dilemma is a central challenge when making decisions under uncertainty. Most common approaches explore by favouring actions with uncertain outcomes. However, aleatoric uncertainty in the outcomes is different from epistemic uncertainty in the estimation task, thus the resulting observations may not necessarily be informative. In this talk, I will present approaches towards efficient information-directed exploration in stochastic multi-armed bandits, Bayesian optimization, reinforcement learning and a rich family of sequential decision problems called partial monitoring. These approaches use information measures for guiding exploration, and their submodularity allows to establish sublinear regret even in non-parametric settings. I will present the theoretical background, as well as empirical demonstrations on deep reinforcement learning tasks. Information-directed Exploration in Bandits and Reinforcement Learning |

11:15-12:00 | Discussion and Break |

12:00-1:00 | Lunch Break |

1:00-1:45 | Kamalika Chaudhuri (University of California, San Diego) Adversarial examples are small imperceptible perturbations to legitimate test inputs that cause machine learning classifiers to misclassify. While recent work has proposed many attacks and defenses, why exactly they arise still remains a mystery. In this talk, we’ll take a closer look at this question. We will look at non-parametric methods, and define a large sample limit for adversarially robust classification that is analogous to the Bayes optimal. We will show then that adversarial robustness in non-parametric methods is mostly a consequence of the training method. If time permits, we will then look at what these findings mean for neural networks. The Mysteries of Adversarial Robustness for Non-parametric Methods and Neural Networks |