Search Results for author: Taiji Suzuki

Found 96 papers, 8 papers with code

Mechanistic Design and Scaling of Hybrid Architectures

no code implementations • 26 Mar 2024 • Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, Ce Zhang, Stefano Massaroli

The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and evaluation.

Paper
Add Code

Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective

no code implementations • 22 Mar 2024 • Shokichi Takakura, Taiji Suzuki

In this paper, we study the feature learning ability of two-layer neural networks in the mean-field regime through the lens of kernel methods.

Paper
Add Code

How do Transformers perform In-Context Autoregressive Learning?

no code implementations • 8 Feb 2024 • Michael E. Sander, Raja Giryes, Taiji Suzuki, Mathieu Blondel, Gabriel Peyré

More precisely, focusing on commuting orthogonal matrices $W$, we first show that a trained one-layer linear Transformer implements one step of gradient descent for the minimization of an inner objective function, when considering augmented tokens.

Language Modelling

Paper
Add Code

Transformers Learn Nonlinear Features In Context: Nonconvex Mean-field Dynamics on the Attention Landscape

no code implementations • 2 Feb 2024 • Juno Kim, Taiji Suzuki

However, existing theoretical studies on how this phenomenon arises are limited to the dynamics of a single layer of attention trained on linear regression tasks.

In-Context Learning

Paper
Add Code

Symmetric Mean-field Langevin Dynamics for Distributional Minimax Problems

no code implementations • 2 Dec 2023 • Juno Kim, Kakei Yamamoto, Kazusato Oko, Zhuoran Yang, Taiji Suzuki

In this paper, we extend mean-field Langevin dynamics to minimax optimization over probability distributions for the first time with symmetric and provably convergent updates.

Paper
Add Code

Scalable Federated Learning for Clients with Different Input Image Sizes and Numbers of Output Categories

no code implementations • 15 Nov 2023 • Shuhei Nitta, Taiji Suzuki, Albert Rodríguez Mulet, Atsushi Yaguchi, Ryusuke Hirai

In this paper, we propose an effective federated learning method named ScalableFL, where the depths and widths of the local models for each client are adjusted according to the clients' input image size and the numbers of output categories.

Federated Learning Image Classification +3

Paper
Add Code

Learning Green's Function Efficiently Using Low-Rank Approximations

1 code implementation • 1 Aug 2023 • Kishan Wimalawarne, Taiji Suzuki, Sophie Langer

Learning the Green's function using deep learning models enables to solve different classes of partial differential equations.

Paper
Code

Graph Neural Networks Provably Benefit from Structural Information: A Feature Learning Perspective

no code implementations • 24 Jun 2023 • Wei Huang, Yuan Cao, Haonan Wang, Xin Cao, Taiji Suzuki

Graph neural networks (GNNs) have pioneered advancements in graph representation learning, exhibiting superior feature learning and performance over multilayer perceptrons (MLPs) when handling graph inputs.

Graph Representation Learning Learning Theory +1

Paper
Add Code

Convergence of mean-field Langevin dynamics: Time and space discretization, stochastic gradient, and variance reduction

no code implementations • 12 Jun 2023 • Taiji Suzuki, Denny Wu, Atsushi Nitanda

Despite the generality of our results, we achieve an improved convergence rate in both the SGD and SVRG settings when specialized to the standard Langevin dynamics.

Paper
Add Code

Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input

no code implementations • 30 May 2023 • Shokichi Takakura, Taiji Suzuki

Despite the great success of Transformer networks in various applications such as natural language processing and computer vision, their theoretical aspects are not well understood.

Paper
Add Code

Tight and fast generalization error bound of graph embedding in metric space

no code implementations • 13 May 2023 • Atsushi Suzuki, Atsushi Nitanda, Taiji Suzuki, Jing Wang, Feng Tian, Kenji Yamanishi

However, recent theoretical analyses have shown a much higher upper bound on non-Euclidean graph embedding's generalization error than Euclidean one's, where a high generalization error indicates that the incompleteness and noise in the data can significantly damage learning performance.

Graph Embedding

Paper
Add Code

Primal and Dual Analysis of Entropic Fictitious Play for Finite-sum Problems

no code implementations • 6 Mar 2023 • Atsushi Nitanda, Kazusato Oko, Denny Wu, Nobuhito Takenouchi, Taiji Suzuki

The entropic fictitious play (EFP) is a recently proposed algorithm that minimizes the sum of a convex functional and entropy in the space of measures -- such an objective naturally arises in the optimization of a two-layer neural network in the mean-field regime.

Image Generation

Paper
Add Code

Diffusion Models are Minimax Optimal Distribution Estimators

no code implementations • 3 Mar 2023 • Kazusato Oko, Shunta Akiyama, Taiji Suzuki

While efficient distribution learning is no doubt behind the groundbreaking success of diffusion modeling, its theoretical guarantees are quite limited.

Paper
Add Code

Koopman-based generalization bound: New aspect for full-rank weights

no code implementations • 12 Feb 2023 • Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Atsushi Nitanda, Taiji Suzuki

Our bound is tighter than existing norm-based bounds when the condition numbers of weight matrices are small.

Paper
Add Code

DIFF2: Differential Private Optimization via Gradient Differences for Nonconvex Distributed Learning

no code implementations • 8 Feb 2023 • Tomoya Murata, Taiji Suzuki

In the previous work, the best known utility bound is $\widetilde O(\sqrt{d}/(n\varepsilon_\mathrm{DP}))$ in terms of the squared full gradient norm, which is achieved by Differential Private Gradient Descent (DP-GD) as an instance, where $n$ is the sample size, $d$ is the problem dimensionality and $\varepsilon_\mathrm{DP}$ is the differential privacy parameter.

Paper
Add Code

Graph Polynomial Convolution Models for Node Classification of Non-Homophilous Graphs

no code implementations • 12 Sep 2022 • Kishan Wimalawarne, Taiji Suzuki

Additionally, we propose adaptive learning between directly graph polynomial convolution models and learning directly from the adjacency matrix.

Generalization Bounds Node Classification

Paper
Add Code

Versatile Single-Loop Method for Gradient Estimator: First and Second Order Optimality, and its Application to Federated Learning

no code implementations • 1 Sep 2022 • Kazusato Oko, Shunta Akiyama, Tomoya Murata, Taiji Suzuki

While variance reduction methods have shown great success in solving large scale optimization problems, many of them suffer from accumulated errors and, therefore, should periodically require the full gradient computation.

Federated Learning

Paper
Add Code

Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student Settings and its Superiority to Kernel Methods

no code implementations • 30 May 2022 • Shunta Akiyama, Taiji Suzuki

While deep learning has outperformed other methods for various tasks, theoretical frameworks that explain its reason have not been fully established.

Paper
Add Code

High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

no code implementations • 3 May 2022 • Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, Greg Yang

We study the first gradient descent step on the first-layer parameters $\boldsymbol{W}$ in a two-layer neural network: $f(\boldsymbol{x}) = \frac{1}{\sqrt{N}}\boldsymbol{a}^\top\sigma(\boldsymbol{W}^\top\boldsymbol{x})$, where $\boldsymbol{W}\in\mathbb{R}^{d\times N}, \boldsymbol{a}\in\mathbb{R}^{N}$ are randomly initialized, and the training objective is the empirical MSE loss: $\frac{1}{n}\sum_{i=1}^n (f(\boldsymbol{x}_i)-y_i)^2$.

Paper
Add Code

Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization

1 code implementation • 30 Mar 2022 • Yuri Kinoshita, Taiji Suzuki

The stochastic gradient Langevin Dynamics is one of the most fundamental algorithms to solve sampling problems and non-convex optimization appearing in several machine learning applications.

Paper
Code

Convergence Error Analysis of Reflected Gradient Langevin Dynamics for Globally Optimizing Non-Convex Constrained Problems

no code implementations • 19 Mar 2022 • Kanji Sato, Akiko Takeda, Reiichiro Kawai, Taiji Suzuki

Gradient Langevin dynamics and a variety of its variants have attracted increasing attention owing to their convergence towards the global optimal solution, initially in the unconstrained convex framework while recently even in convex constrained non-convex problems.

Paper
Add Code

Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD for Communication Efficient Nonconvex Distributed Learning

no code implementations • 12 Feb 2022 • Tomoya Murata, Taiji Suzuki

In recent centralized nonconvex distributed learning and federated learning, local methods are one of the promising approaches to reduce communication time.

Distributed Optimization Federated Learning

Paper
Add Code

Convex Analysis of the Mean Field Langevin Dynamics

no code implementations • 25 Jan 2022 • Atsushi Nitanda, Denny Wu, Taiji Suzuki

In this work, we give a concise and self-contained convergence rate analysis of the mean field Langevin dynamics with respect to the (regularized) objective function in both continuous and discrete time settings.

Paper
Add Code

A Scaling Law for Syn-to-Real Transfer: How Much Is Your Pre-training Effective?

no code implementations • 29 Sep 2021 • Hiroaki Mikami, Kenji Fukumizu, Shogo Murai, Shuji Suzuki, Yuta Kikuchi, Taiji Suzuki, Shin-ichi Maeda, Kohei Hayashi

Synthetic-to-real transfer learning is a framework in which a synthetically generated dataset is used to pre-train a model to improve its performance on real vision tasks.

Image Generation Transfer Learning

Paper
Add Code

Learnability of convolutional neural networks for infinite dimensional input via mixed and anisotropic smoothness

no code implementations • ICLR 2022 • Sho Okumoto, Taiji Suzuki

Although the approximation and estimation errors of neural networks are affected by the curse of dimensionality in the existing analyses for typical function spaces such as the \Holder and Besov spaces, we show that, by considering anisotropic smoothness, they can alleviate exponential dependency on the dimensionality but they only depend on the smoothness of the target functions.

speech-recognition Speech Recognition

Paper
Add Code

Understanding the Variance Collapse of SVGD in High Dimensions

no code implementations • ICLR 2022 • Jimmy Ba, Murat A Erdogdu, Marzyeh Ghassemi, Shengyang Sun, Taiji Suzuki, Denny Wu, Tianzong Zhang

Stein variational gradient descent (SVGD) is a deterministic inference algorithm that evolves a set of particles to fit a target distribution.

Computational Efficiency Vocal Bursts Intensity Prediction

Paper
Add Code

Particle Stochastic Dual Coordinate Ascent: Exponential convergent algorithm for mean field neural network optimization

no code implementations • ICLR 2022 • Kazusato Oko, Taiji Suzuki, Atsushi Nitanda, Denny Wu

We introduce Particle-SDCA, a gradient-based optimization algorithm for two-layer neural networks in the mean field regime that achieves exponential convergence rate in regularized empirical risk minimization.

Paper
Add Code

Takeuchi's Information Criteria as Generalization Measures for DNNs Close to NTK Regime

no code implementations • 29 Sep 2021 • Hiroki Naganuma, Taiji Suzuki, Rio Yokota, Masahiro Nomura, Kohta Ishikawa, Ikuro Sato

Generalization measures are intensively studied in the machine learning community for better modeling generalization gaps.

Hyperparameter Optimization

Paper
Add Code

A Scaling Law for Synthetic-to-Real Transfer: How Much Is Your Pre-training Effective?

1 code implementation • 25 Aug 2021 • Hiroaki Mikami, Kenji Fukumizu, Shogo Murai, Shuji Suzuki, Yuta Kikuchi, Taiji Suzuki, Shin-ichi Maeda, Kohei Hayashi

Synthetic-to-real transfer learning is a framework in which a synthetically generated dataset is used to pre-train a model to improve its performance on real vision tasks.

Image Generation Transfer Learning

Paper
Code

Layer-wise Adaptive Graph Convolution Networks Using Generalized Pagerank

no code implementations • 24 Aug 2021 • Kishan Wimalawarne, Taiji Suzuki

We investigate adaptive layer-wise graph convolution in deep GCN models.

Generalization Bounds Node Classification

Paper
Add Code

AutoLL: Automatic Linear Layout of Graphs based on Deep Neural Network

no code implementations • 5 Aug 2021 • Chihiro Watanabe, Taiji Suzuki

However, it is limited to a two-mode reordering (i. e., the rows and columns are reordered separately) and it cannot be applied in the one-mode setting (i. e., the same node order is used for reordering both rows and columns), owing to the characteristics of its model architecture.

Paper
Add Code

On Learnability via Gradient Method for Two-Layer ReLU Neural Networks in Teacher-Student Setting

no code implementations • 11 Jun 2021 • Shunta Akiyama, Taiji Suzuki

Deep learning empirically achieves high performance in many applications, but its training dynamics has not been fully understood theoretically.

Paper
Add Code

Particle Dual Averaging: Optimization of Mean Field Neural Network with Global Convergence Rate Analysis

no code implementations • NeurIPS 2021 • Atsushi Nitanda, Denny Wu, Taiji Suzuki

An important application of the proposed method is the optimization of neural network in the mean field regime, which is theoretically attractive due to the presence of nonlinear feature learning, but quantitative convergence rate can be challenging to obtain.

Paper
Add Code

Deep Two-Way Matrix Reordering for Relational Data Analysis

no code implementations • 26 Mar 2021 • Chihiro Watanabe, Taiji Suzuki

This denoised mean matrix can be used to visualize the global structure of the reordered observed matrix.

Vocal Bursts Valence Prediction

Paper
Add Code

A Goodness-of-fit Test on the Number of Biclusters in a Relational Data Matrix

no code implementations • 23 Feb 2021 • Chihiro Watanabe, Taiji Suzuki

Biclustering is a method for detecting homogeneous submatrices in a given observed matrix, and it is an effective tool for relational data analysis.

Clustering

Paper
Add Code

Bias-Variance Reduced Local SGD for Less Heterogeneous Federated Learning

no code implementations • 5 Feb 2021 • Tomoya Murata, Taiji Suzuki

Recently, local SGD has got much attention and been extensively studied in the distributed learning community to overcome the communication bottleneck problem.

Distributed Optimization Federated Learning

Paper
Add Code

Particle Dual Averaging: Optimization of Mean Field Neural Networks with Global Convergence Rate Analysis

no code implementations • NeurIPS 2021 • Atsushi Nitanda, Denny Wu, Taiji Suzuki

Paper
Add Code

Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods

no code implementations • ICLR 2021 • Taiji Suzuki, Shunta Akiyama

Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature.

Paper
Add Code

Estimation error analysis of deep learning on the regression problem on the variable exponent Besov space

no code implementations • 23 Sep 2020 • Kazuma Tsuji, Taiji Suzuki

In this study, we focus on the adaptivity of deep learning; consequently, we treat the variable exponent Besov space, which has a different smoothness depending on the input location $x$.

speech-recognition Speech Recognition

Paper
Add Code

MSR-DARTS: Minimum Stable Rank of Differentiable Architecture Search

no code implementations • 19 Sep 2020 • Kengo Machida, Kuniaki Uto, Koichi Shinoda, Taiji Suzuki

To overcome this problem, we propose a method called minimum stable rank DARTS (MSR-DARTS), for finding a model with the best generalization error by replacing architecture optimization with the selection process using the minimum stable rank criterion.

Ranked #24 on Neural Architecture Search on CIFAR-10

Neural Architecture Search

Paper
Add Code

Quantitative Understanding of VAE as a Non-linearly Scaled Isometric Embedding

no code implementations • 30 Jul 2020 • Akira Nakagawa, Keizo Kato, Taiji Suzuki

According to the Rate-distortion theory, the optimal transform coding is achieved by using an orthonormal transform with PCA basis where the transform space is isometric to the input.

Paper
Add Code

Generalization bound of globally optimal non-convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics

no code implementations • NeurIPS 2020 • Taiji Suzuki

Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence.

Paper
Add Code

Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime

no code implementations • ICLR 2021 • Atsushi Nitanda, Taiji Suzuki

In this study, we show that the averaged stochastic gradient descent can achieve the minimax optimal convergence rate, with the global convergence guarantee, by exploiting the complexities of the target function and the RKHS associated with the NTK.

Paper
Add Code

Gradient Descent in RKHS with Importance Labeling

no code implementations • 19 Jun 2020 • Tomoya Murata, Taiji Suzuki

In this paper, we study importance labeling problem, in which we are given many unlabeled data and select a limited number of data to be labeled from the unlabeled data, and then a learning algorithm is executed on the selected one.

Paper
Add Code

When Does Preconditioning Help or Hurt Generalization?

no code implementations • ICLR 2021 • Shun-ichi Amari, Jimmy Ba, Roger Grosse, Xuechen Li, Atsushi Nitanda, Taiji Suzuki, Denny Wu, Ji Xu

While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question.

regression Second-order methods

Paper
Add Code

Optimization and Generalization Analysis of Transduction through Gradient Boosting and Application to Multi-scale Graph Neural Networks

1 code implementation • NeurIPS 2020 • Kenta Oono, Taiji Suzuki

By combining it with generalization gap bounds in terms of transductive Rademacher complexity, we show that a test error bound of a specific type of multi-scale GNNs that decreases corresponding to the number of node aggregations under some conditions.

Learning Theory Transductive Learning

Paper
Code

Selective Inference for Latent Block Models

no code implementations • 27 May 2020 • Chihiro Watanabe, Taiji Suzuki

In this case, it becomes crucial to consider the selective bias in the block structure, that is, the block structure is selected from all the possible cluster memberships based on some criterion by the clustering algorithm.

Clustering Model Selection

Paper
Add Code

Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint

no code implementations • ICLR 2020 • Jimmy Ba, Murat Erdogdu, Taiji Suzuki, Denny Wu, Tianzong Zhang

This paper investigates the generalization properties of two-layer neural networks in high-dimensions, i. e. when the number of samples $n$, features $d$, and neurons $h$ tend to infinity at the same rate.

Inductive Bias Vocal Bursts Valence Prediction

Paper
Add Code

Meta Cyclical Annealing Schedule: A Simple Approach to Avoiding Meta-Amortization Error

no code implementations • 4 Mar 2020 • Yusuke Hayashi, Taiji Suzuki

To address this challenge, we design a novel meta-regularization objective using {\it cyclical annealing schedule} and {\it maximum mean discrepancy} (MMD) criterion.

Few-Shot Learning

Paper
Add Code

Dimension-free convergence rates for gradient Langevin dynamics in RKHS

no code implementations • 29 Feb 2020 • Boris Muzellec, Kanji Sato, Mathurin Massias, Taiji Suzuki

In this work, we provide a convergence analysis of GLD and SGLD when the optimization space is an infinite dimensional Hilbert space.

Paper
Add Code

Understanding Generalization in Deep Learning via Tensor Methods

no code implementations • 14 Jan 2020 • Jingling Li, Yanchao Sun, Jiahao Su, Taiji Suzuki, Furong Huang

Recently proposed complexity measures have provided insights to understanding the generalizability in neural networks from perspectives of PAC-Bayes, robustness, overparametrization, compression and so on.

Paper
Add Code

Domain Adaptation Regularization for Spectral Pruning

no code implementations • 26 Dec 2019 • Laurent Dillard, Yosuke Shinya, Taiji Suzuki

We also show that our method outperforms an existing compression method studied in the DA setting by a large margin for high compression rates.

Domain Adaptation Model Compression

Paper
Add Code

Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features

no code implementations • 13 Nov 2019 • Shingo Yashima, Atsushi Nitanda, Taiji Suzuki

To address this problem, sketching and stochastic gradient methods are the most commonly used techniques to derive efficient large-scale learning algorithms.

Binary Classification Classification +1

Paper
Add Code

Decomposable-Net: Scalable Low-Rank Compression for Neural Networks

1 code implementation • 29 Oct 2019 • Atsushi Yaguchi, Taiji Suzuki, Shuhei Nitta, Yukinobu Sakata, Akiyuki Tanizawa

Compressing DNNs is important for the real-world applications operating on resource-constrained devices.

Image Classification Low-rank compression

Paper
Code

Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic Besov space

no code implementations • NeurIPS 2021 • Taiji Suzuki, Atsushi Nitanda

The results show that deep learning has better dependence on the input dimensionality if the target function possesses anisotropic smoothness, and it achieves an adaptive rate for functions with spatially inhomogeneous smoothness.

Paper
Add Code

Towards Characterizing the High-dimensional Bias of Kernel-based Particle Inference Algorithms

no code implementations • pproximateinference AABI Symposium 2019 • Jimmy Ba, Murat A. Erdogdu, Marzyeh Ghassemi, Taiji Suzuki, Shengyang Sun, Denny Wu, Tianzong Zhang

Particle-based inference algorithm is a promising method to efficiently generate samples for an intractable target distribution by iteratively updating a set of particles.

LEMMA

Paper
Add Code

Scalable Deep Neural Networks via Low-Rank Matrix Factorization

no code implementations • 25 Sep 2019 • Atsushi Yaguchi, Taiji Suzuki, Shuhei Nitta, Yukinobu Sakata, Akiyuki Tanizawa

Compressing deep neural networks (DNNs) is important for real-world applications operating on resource-constrained devices.

Image Classification

Paper
Add Code

Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network

no code implementations • ICLR 2020 • Taiji Suzuki, Hiroshi Abe, Tomoaki Nishimura

However, the compression based bound can be applied only to a compressed network, and it is not applicable to the non-compressed original network.

Learning Theory

Paper
Add Code

Understanding the Effects of Pre-Training for Object Detectors via Eigenspectrum

no code implementations • 9 Sep 2019 • Yosuke Shinya, Edgar Simo-Serra, Taiji Suzuki

Furthermore, we propose a method for automatically determining the widths (the numbers of channels) of object detectors based on the eigenspectrum.

Image Classification Object +2

Paper
Add Code

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

no code implementations • 26 Jun 2019 • Kosuke Haruki, Taiji Suzuki, Yohei Hamakawa, Takeshi Toda, Ryuji Sakai, Masahiro Ozawa, Mitsuhiro Kimura

Large-batch stochastic gradient descent (SGD) is widely used for training in distributed deep learning because of its training-time efficiency, however, extremely large-batch SGD leads to poor generalization and easily converges to sharp minima, which prevents naive large-scale data-parallel SGD (DP-SGD) from converging to good minima.

Paper
Add Code

Goodness-of-fit Test for Latent Block Models

no code implementations • 10 Jun 2019 • Chihiro Watanabe, Taiji Suzuki

Latent block models are used for probabilistic biclustering, which is shown to be an effective method for analyzing various relational data sets.

Paper
Add Code

Accelerated Sparsified SGD with Error Feedback

no code implementations • 29 May 2019 • Tomoya Murata, Taiji Suzuki

Several work has shown that {\it{sparsified}} stochastic gradient descent method (SGD) with {\it{error feedback}} asymptotically achieves the same rate as (non-sparsified) parallel SGD.

Distributed Optimization

Paper
Add Code

Graph Neural Networks Exponentially Lose Expressive Power for Node Classification

1 code implementation • ICLR 2020 • Kenta Oono, Taiji Suzuki

We show that when the Erd\H{o}s -- R\'{e}nyi graph is sufficiently dense and large, a broad range of GCNs on it suffers from the "information loss" in the limit of infinite layers with high probability.

Classification General Classification +1

Paper
Code

Gradient Descent can Learn Less Over-parameterized Two-layer Neural Networks on Classification Problems

no code implementations • 23 May 2019 • Atsushi Nitanda, Geoffrey Chinot, Taiji Suzuki

Most studies especially focused on the regression problems with the squared loss function, except for a few, and the importance of the positivity of the neural tangent kernel has been pointed out.

General Classification Generalization Bounds

Paper
Add Code

On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces

no code implementations • 22 May 2019 • Satoshi Hayakawa, Taiji Suzuki

Whereas existing theoretical studies of deep learning have been based mainly on mathematical theories of well-known function classes such as H\"{o}lder and Besov classes, we focus on function classes with discontinuity and sparsity, which are those naturally assumed in practice.

Paper
Add Code

Approximation and non-parametric estimation of ResNet-type convolutional neural networks via block-sparse fully-connected neural networks

no code implementations • ICLR 2019 • Kenta Oono, Taiji Suzuki

We develop new approximation and statistical learning theories of convolutional neural networks (CNNs) via the ResNet-type structure where the channel size, filter size, and width are fixed.

Paper
Add Code

Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks

no code implementations • 24 Mar 2019 • Kenta Oono, Taiji Suzuki

The key idea is that we can replicate the learning ability of Fully-connected neural networks (FNNs) by tailored CNNs, as long as the FNNs have \textit{block-sparse} structures.

Vocal Bursts Type Prediction

Paper
Add Code

Adam Induces Implicit Weight Sparsity in Rectifier Neural Networks

no code implementations • 19 Dec 2018 • Atsushi Yaguchi, Taiji Suzuki, Wataru Asano, Shuhei Nitta, Yukinobu Sakata, Akiyuki Tanizawa

In recent years, deep neural networks (DNNs) have been applied to various machine leaning tasks, including image recognition, speech recognition, and machine translation.

Machine Translation speech-recognition +2

Paper
Add Code

Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality

no code implementations • ICLR 2019 • Taiji Suzuki

In addition to this, it is shown that deep learning can avoid the curse of dimensionality if the target function is in a mixed smooth Besov space.

Paper
Add Code

Sample Efficient Stochastic Gradient Iterative Hard Thresholding Method for Stochastic Sparse Linear Regression with Limited Attribute Observation

no code implementations • NeurIPS 2018 • Tomoya Murata, Taiji Suzuki

We develop new stochastic gradient methods for efficiently solving sparse linear regression in a partial attribute observation setting, where learners are only allowed to observe a fixed number of actively chosen attributes per example at training and prediction times.

Attribute

Paper
Add Code

Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error

no code implementations • 26 Aug 2018 • Taiji Suzuki, Hiroshi Abe, Tomoya Murata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi, So Hirai, Masatoshi Yukishima, Tomoaki Nishimura

The concept of model compression is also important for analyzing the generalization error of deep learning, known as the compression-based error bound.

Edge-computing Learning Theory +1

Paper
Add Code

Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

no code implementations • 14 Jun 2018 • Atsushi Nitanda, Taiji Suzuki

In this paper, we show an exponential convergence of the expected classification error in the final phase of the stochastic gradient descent for a wide class of differentiable convex loss functions under similar assumptions.

Binary Classification Classification +1

Paper
Add Code

Cross-domain Recommendation via Deep Domain Adaptation

no code implementations • 8 Mar 2018 • Heishiro Kanagawa, Hayato Kobayashi, Nobuyuki Shimizu, Yukihiro Tagami, Taiji Suzuki

The behavior of users in certain services could be a clue that can be used to infer their preferences and may be used to make recommendations for other services they have never used.

Collaborative Filtering Denoising +2

Paper
Add Code

Functional Gradient Boosting based on Residual Network Perception

no code implementations • ICML 2018 • Atsushi Nitanda, Taiji Suzuki

Residual Networks (ResNets) have become state-of-the-art models in deep learning and several theoretical studies have been devoted to understanding why ResNet works so well.

Paper
Add Code

Gradient Layer: Enhancing the Convergence of Adversarial Training for Generative Models

no code implementations • 7 Jan 2018 • Atsushi Nitanda, Taiji Suzuki

In this paper, this phenomenon is explained from the functional gradient method perspective of the gradient layer.

Paper
Add Code

Stochastic Particle Gradient Descent for Infinite Ensembles

no code implementations • 14 Dec 2017 • Atsushi Nitanda, Taiji Suzuki

The superior performance of ensemble methods with infinite models are well known.

Ensemble Learning Stochastic Optimization

Paper
Add Code

Independently Interpretable Lasso: A New Regularizer for Sparse Regression with Uncorrelated Variables

no code implementations • 6 Nov 2017 • Masaaki Takada, Taiji Suzuki, Hironori Fujisawa

However, one of the biggest issues in sparse regularization is that its performance is quite sensitive to correlations between features.

regression

Paper
Add Code

Fast learning rate of deep learning via a kernel perspective

no code implementations • 29 May 2017 • Taiji Suzuki

Our point of view is to deal with the ordinary finite dimensional deep neural network as a finite approximation of the infinite dimensional one.

Paper
Add Code

Trimmed Density Ratio Estimation

1 code implementation • NeurIPS 2017 • Song Liu, Akiko Takeda, Taiji Suzuki, Kenji Fukumizu

Density ratio estimation is a vital tool in both machine learning and statistical community.

BIG-bench Machine Learning Density Ratio Estimation

Paper
Code

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

no code implementations • NeurIPS 2017 • Tomoya Murata, Taiji Suzuki

In this paper, we develop a new accelerated stochastic gradient method for efficiently solving the convex regularized empirical risk minimization problem in mini-batch settings.

Paper
Add Code

Learning Sparse Structural Changes in High-dimensional Markov Networks: A Review on Methodologies and Theories

no code implementations • 6 Jan 2017 • Song Liu, Kenji Fukumizu, Taiji Suzuki

Recent years have seen an increasing popularity of learning the sparse \emph{changes} in Markov Networks.

Paper
Add Code

Minimax Optimal Alternating Minimization for Kernel Nonparametric Tensor Learning

no code implementations • NeurIPS 2016 • Taiji Suzuki, Heishiro Kanagawa, Hayato Kobayashi, Nobuyuki Shimizu, Yukihiro Tagami

We investigate the statistical performance and computational efficiency of the alternating minimization procedure for nonparametric tensor learning.

Computational Efficiency

Paper
Add Code

Stochastic dual averaging methods using variance reduction techniques for regularized empirical risk minimization problems

no code implementations • 8 Mar 2016 • Tomoya Murata, Taiji Suzuki

We consider a composite convex minimization problem associated with regularized empirical risk minimization, which often arises in machine learning.

BIG-bench Machine Learning

Paper
Add Code

Structure Learning of Partitioned Markov Networks

no code implementations • 2 Apr 2015 • Song Liu, Taiji Suzuki, Masashi Sugiyama, Kenji Fukumizu

We learn the structure of a Markov Network between two groups of random variables from joint observations.

Time Series Time Series Analysis

Paper
Add Code

Convergence rate of Bayesian tensor estimator: Optimal rate without restricted strong convexity

no code implementations • 13 Aug 2014 • Taiji Suzuki

In this paper, we investigate the statistical convergence rate of a Bayesian low-rank tensor estimator.

Collaborative Filtering Multi-Task Learning +1

Paper
Add Code

Spectral norm of random tensors

no code implementations • 7 Jul 2014 • Ryota Tomioka, Taiji Suzuki

We show that the spectral norm of a random $n_1\times n_2\times \cdots \times n_K$ tensor (or higher-order array) scales as $O\left(\sqrt{(\sum_{k=1}^{K}n_k)\log(K)}\right)$ under some sub-Gaussian assumption on the entries.

Paper
Add Code

Support Consistency of Direct Sparse-Change Learning in Markov Networks

no code implementations • 2 Jul 2014 • Song Liu, Taiji Suzuki, Raissa Relator, Jun Sese, Masashi Sugiyama, Kenji Fukumizu

We study the problem of learning sparse structure changes between two Markov networks $P$ and $Q$.

Change Detection

Paper
Add Code

Stochastic Dual Coordinate Ascent with Alternating Direction Multiplier Method

no code implementations • 4 Nov 2013 • Taiji Suzuki

We propose a new stochastic dual coordinate ascent technique that can be applied to a wide range of regularized learning problems.

Paper
Add Code

Direct Learning of Sparse Changes in Markov Networks by Density Ratio Estimation

no code implementations • 25 Apr 2013 • Song Liu, John A. Quinn, Michael U. Gutmann, Taiji Suzuki, Masashi Sugiyama

We propose a new method for detecting changes in Markov network structure between two sets of samples.

Density Ratio Estimation

Paper
Add Code

Convex Tensor Decomposition via Structured Schatten Norm Regularization

no code implementations • NeurIPS 2013 • Ryota Tomioka, Taiji Suzuki

We discuss structured Schatten norms for tensor decomposition that includes two recently proposed norms ("overlapped" and "latent") for convex-optimization-based tensor decomposition, and connect tensor decomposition with wider literature on structured sparsity.

Tensor Decomposition

Paper
Add Code

Density-Difference Estimation

no code implementations • NeurIPS 2012 • Masashi Sugiyama, Takafumi Kanamori, Taiji Suzuki, Marthinus D. Plessis, Song Liu, Ichiro Takeuchi

A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference.

Change Point Detection

Paper
Add Code

Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness

no code implementations • 2 Mar 2012 • Taiji Suzuki, Masashi Sugiyama

If the ground truth is smooth, we show a faster convergence rate for the elastic-net regularization with less conditions than $\ell_1$-regularization; otherwise, a faster convergence rate for the $\ell_1$-regularization is shown.

Paper
Add Code

Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

no code implementations • NeurIPS 2011 • Taiji Suzuki

Finally, we show that, when the complexities of candidate reproducing kernel Hilbert spaces are inhomogeneous, dense type regularization shows better learning rate compared with sparse ℓ1 regularization.

Vocal Bursts Type Prediction

Paper
Add Code

Relative Density-Ratio Estimation for Robust Distribution Comparison

no code implementations • NeurIPS 2011 • Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama

Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test.

Density Ratio Estimation Outlier Detection +1

Paper
Add Code

Statistical Performance of Convex Tensor Decomposition

no code implementations • NeurIPS 2011 • Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, Hisashi Kashima

We analyze the statistical performance of a recently proposed convex tensor decomposition algorithm.

Tensor Decomposition

Paper
Add Code

Condition Number Analysis of Kernel-based Density Ratio Estimation

1 code implementation • 15 Dec 2009 • Takafumi Kanamori, Taiji Suzuki, Masashi Sugiyama

We show that the kernel least-squares method has a smaller condition number than a version of kernel mean matching and other M-estimators, implying that the kernel least-squares method has preferable numerical properties.

Density Ratio Estimation feature selection +1

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.