Therefore, we introduce the notion of (delta, epsilon)-stationarity, a generalization that allows for a point to be within distance delta of an epsilon-stationary point and reduces to epsilon-stationarity for smooth functions.

Recent work in imitation learning has shown that having an expert controller that is both suitably smooth and stable enables stronger guarantees on the performance of the learned controller.

We use the setting of spatially invariant systems as an idealization for which concrete and detailed results are given.

This results in the convergence of node representations to the top-$k$ eigenspace of the message-passing operator; (c) moreover, we show that the centering step of a normalization layer -- which can be understood as a projection -- alters the graph signal in message-passing in such a way that relevant information can become harder to extract.

Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models.

The focus of this paper is on linear system identification in the setting where it is known that the underlying partially-observed linear dynamical system lies within a finite collection of known candidate models.

Agents can share their learning experience with their peers by taking actions observable to them, with values from a finite feasible set of states.

In this work, we studythis model on arbitrary networks, providing an estimator which converges to the inherent beliefs even in consensus situations.

Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics.

Social pressure is a key factor affecting the evolution of opinions on networks in many types of settings, pushing people to conform to their neighbors' opinions.

Recent work in imitation learning has shown that having an expert controller that is both suitably smooth and stable enables stronger guarantees on the performance of the learned controller.

Oversmoothing in Graph Neural Networks (GNNs) refers to the phenomenon where increasing network depth leads to homogeneous node representations.

Under this notion, we then analyze algorithms that find approximate flat minima efficiently.

This is in clear contrast to the well-established assumption in folklore non-convex optimization, a. k. a.

Oversmoothing is a central challenge of building more powerful Graph Neural Networks (GNNs).

Recent approaches to data-driven MPC have used the simplest form of imitation learning known as behavior cloning to learn controllers that mimic the performance of MPC by online sampling of the trajectories of the closed-loop MPC system.

In this paper, we aim to bridge this gap by analyzing the \emph{local convergence} of general \emph{nonconvex-nonconcave} minimax problems.

We study a class of dynamical networks modeled by linear and time-invariant systems which are described by state-space realizations.

When $r \ll p$, these complexities are smaller than the known complexities of $\mathcal{O}(p \log(1/\epsilon))$ and $\mathcal{O}(p/\epsilon^2)$ of {\gd} in the strongly convex and non-convex settings, respectively.

In this paper, we focus on this problem and propose a novel personalized Federated Learning scheme based on Optimal Transport (FedOT) as a learning algorithm that learns the optimal transport maps for transferring data points to a common distribution as well as the prediction model under the applied transport map.

In this paper, we study a linear bandit optimization problem in a federated setting where a large collection of distributed agents collaboratively learn a common linear bandit model.

Successful predictive modeling of epidemics requires an understanding of the implicit feedback control strategies which are implemented by populations to modulate the spread of contagion.

The model represents time series of cases and fatalities as a mixture of Gaussian curves, providing a flexible function class to learn from data compared to traditional mechanistic models.

Under some assumptions on the loss function, e. g., strong convexity in parameter, $\eta$-H\"older smoothness in data, etc., we prove that the federated oracle complexity of FedLRGD scales like $\phi m(p/\epsilon)^{\Theta(d/\eta)}$ and that of FedAve scales like $\phi m(p/\epsilon)^{3/4}$ (neglecting sub-dominant factors), where $\phi\gg 1$ is a "communication-to-computation ratio," $p$ is the parameter dimension, and $d$ is the data dimension.

We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system.

In this paper, we discuss a distributed control architecture, aimed at networks with linear and time-invariant dynamics, which is amenable to convex formulations for controller design.

This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks.

At each time step, agents broadcast their declared opinions on a social network, which are governed by the agents' inherent opinions and social pressure.

We provide a first-order oracle complexity lower bound for finding stationary points of min-max optimization problems where the objective function is smooth, nonconvex in the minimization variable, and strongly concave in the maximization variable.

We propose matrix norm inequalities that extend the Recht-R\'e (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-R\'e conjecture.

In particular, standard results on optimal convergence rates for stochastic optimization assume either there exists a uniform bound on the moments of the gradient noise, or that the noise decays as the algorithm progresses.

In this paper, we study the problem of learning the skill distribution of a population of agents from observations of pairwise games in a tournament.

Non-Bayesian social learning theory provides a framework that solves this problem in an efficient manner by allowing the agents to sequentially communicate and update their beliefs for each hypothesis over the network.

In contrast, we demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of both GD and SGD in important regimes.

We propose a distributed, cubic-regularized Newton method for large-scale convex optimization over networks.

Motivated by optimal transport theory, we design the zero-sum game in GAT-GMM using a random linear generator and a softmax-based quadratic discriminator architecture, which leads to a non-convex concave minimax optimization problem.

In such settings, the training data is often statistically heterogeneous and manifests various distribution shifts across users, which degrades the performance of the learnt model.

In this paper, we study the problem of learning the skill distribution of a population of agents from observations of pairwise games in a tournament.

We study oracle complexity of gradient based methods for stochastic approximation problems.

In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds.

Federated learning is a distributed framework according to which a model is trained over a set of devices, while keeping data localized.

Non-Bayesian social learning theory provides a framework that models distributed inference for a group of agents interacting over a social network.

Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor.

We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks.

We also prove that width $\Theta(\sqrt{N})$ is necessary and sufficient for memorizing $N$ data points, proving tight bounds on memorization capacity.

In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast.

We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set $\mathcal{C}$ is simple for a quadratic objective function.

The paper shows that communities can be detected by applying a spectral method to the covariance matrix of graph signals.

Simplicial complexes, a mathematical object common in topological data analysis, have emerged as a model for multi-nodal interactions that occur in several complex systems; for example, biological interactions occur between a set of molecules rather than just two, and communication systems can have group messages and not just person-to-person messages.

We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method.

The objective of this paper is to focus on resilient matroid-constrained problems arising in control and sensing but in the presence of sensor and actuator failures.

In this paper, we provide the first scalable algorithm, that achieves the following characteristics: system-wide resiliency, i. e., the algorithm is valid for any number of denial-of-service attacks, deletions, or failures; adaptiveness, i. e., at each time step, the algorithm selects system elements based on the history of inflicted attacks, deletions, or failures; and provable approximation performance, i. e., the algorithm guarantees for monotone objective functions a solution close to the optimal.

Networks provide a powerful formalism for modeling complex systems by using a model of pairwise interactions.

Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust.

We study the error landscape of deep linear and nonlinear neural networks with the squared error loss.

We study the computations that Bayesian agents undertake when exchanging opinions over a network.

We formulate this problem as a distributed online optimization where agents communicate with each other to track the minimizer of the global loss.

While such repeated applications of the Bayes' rule in networks can become computationally intractable, in this paper, we show that in the canonical cases of directed star, circle or path networks and their combinations, one can derive a class of memoryless update rules that replicate that of a single Bayesian agent but replace the self beliefs with the beliefs of the neighbors.

In each case we rely on an aggregation scheme to combine the observations of all agents; moreover, when the agents receive streams of data over time, we modify the update rules to accommodate the most recent observations.

A network of agents aim to track the minimizer of a global time-varying convex function.

In this paper, we address tracking of a time-varying parameter with unknown dynamics.

To this end, we use a notion of dynamic regret which suits the online, non-stationary nature of the problem.

We analyze a model of learning and belief formation in networks in which agents follow Bayes rule yet they do not recall their history of past observations and cannot reason about how other agents' beliefs are formed.

Each agent might not be able to distinguish the true state based only on her private observations.

A network of agents attempt to learn some unknown state of the world drawn by nature from a finite set.

Recent literature on online learning has focused on developing adaptive algorithms that take advantage of a regularity of the sequence of observations, yet retain worst-case performance guarantees.

In contrast to the existing literature which focuses on asymptotic learning, we provide a finite-time analysis.

Based on the decomposition of the global loss function, we introduce two update mechanisms, each of which generates an estimate of the true state.

When the true state is globally identifiable, and the network is connected, we prove that agents eventually learn the true parameter using a randomized gossip scheme.

