In order to empower NAM with feature selection and improve the generalization, we propose the sparse neural additive models (SNAM) that employ the group sparsity regularization (e. g. Group LASSO), where each feature is learned by a sub-network whose trainable parameters are clustered as a group.
Second, by using labeled data from the source task to compute the reference prior, we develop a new pretraining method for transfer learning that allows data from the target task to maximally affect the Bayesian posterior.
no code implementations • 19 Jan 2022 • Joshua T. Vogelstein, Timothy Verstynen, Konrad P. Kording, Leyla Isik, John W. Krakauer, Ralph Etienne-Cummings, Elizabeth L. Ogburn, Carey E. Priebe, Randal Burns, Kwame Kutten, James J. Knierim, James B. Potash, Thomas Hartung, Lena Smirnova, Paul Worley, Alena Savonenko, Ian Phillips, Michael I. Miller, Rene Vidal, Jeremias Sulam, Adam Charles, Noah J. Cowan, Maxim Bichuch, Archana Venkataraman, Chen Li, Nitish Thakor, Justus M Kebschull, Marilyn Albert, Jinchong Xu, Marshall Hussain Shuler, Brian Caffo, Tilak Ratnanather, Ali Geisa, Seung-Eon Roh, Eva Yezerets, Meghana Madhyastha, Javier J. How, Tyler M. Tomita, Jayanta Dey, Ningyuan, Huang, Jong M. Shin, Kaleab Alemayehu Kinfu, Pratik Chaudhari, Ben Baker, Anna Schapiro, Dinesh Jayaraman, Eric Eaton, Michael Platt, Lyle Ungar, Leila Wehbe, Adam Kepecs, Amy Christensen, Onyema Osuagwu, Bing Brunton, Brett Mensh, Alysson R. Muotri, Gabriel Silva, Francesca Puppo, Florian Engert, Elizabeth Hillman, Julia Brown, Chris White, Weiwei Yang
We call this 'retrospective learning'.
This structure is mirrored in a network trained on this data: we show that the Hessian and the Fisher Information Matrix (FIM) have eigenvalues that are spread uniformly over exponentially large ranges.
Heterogeneity in medical data, e. g., from data collected at different sites and with different protocols in a clinical study, is a fundamental hurdle for accurate prediction using machine learning models, as such models often fail to generalize well.
This paper argues that continual learning methods can benefit by splitting the capacity of the learner across multiple models.
Ranked #1 on Continual Learning on Coarse-CIFAR100
We empirically demonstrate that our approach can predict the rope state accurately up to ten steps into the future and that our algorithm can find the optimal action given an initial state and a goal state.
We can also tackle situations where we do not have access to ground-truth labels on target data; we show how one can use auxiliary tasks for adaptation; these tasks employ covariates such as age, gender and race which are easy to obtain but nevertheless correlated to the main task.
Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration.
Our method can handle an arbitrary number of pursuers and targets; we show results for tasks consisting up to 1000 pursuers tracking 1000 targets.
Using tools in information geometry, the distance is defined to be the length of the shortest weight trajectory on a Riemannian manifold as a classifier is fitted on an interpolated task.
Autonomous navigation in crowded, complex urban environments requires interacting with other agents on the road.
This paper prescribes a suite of techniques for off-policy Reinforcement Learning (RL) that simplify the training process and reduce the sample complexity.
Automated machine learning (AutoML) can produce complex model ensembles by stacking, bagging, and boosting many individual models like trees, deep networks, and nearest neighbor estimators.
Autonomous race cars require perception, estimation, planning, and control modules which work together asynchronously while driving at the limit of a vehicle's handling capability.
We present TraDE, a self-attention-based architecture for auto-regressive density estimation with continuous and discrete valued data.
This paper employs a formal connection of machine learning with thermodynamics to characterize the quality of learnt representations for transfer learning.
Our findings challenge common practices of fine-tuning and encourages deep learning practitioners to rethink the hyperparameters for fine-tuning.
Adversarial Training is a training procedure aiming at providing models that are robust to worst-case perturbations around predefined points.
When fine-tuned transductively, this outperforms the current state-of-the-art on standard datasets such as Mini-ImageNet, Tiered-ImageNet, CIFAR-FS and FC-100 with the same hyper-parameters.
Extensive experiments on the Atari-2600 and MuJoCo benchmark suites show that this simple technique is effective in reducing the sample complexity of state-of-the-art algorithms.
So SGD does perform variational inference, but for a different loss than the one used to compute the gradients.
We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error rates that are nearly state-of-the-art on several benchmarks including CIFAR-10 and CIFAR-100, without introducing any additional hyper-parameters.
In this paper we establish a connection between non-convex optimization methods for training deep neural networks and nonlinear partial differential equations (PDEs).
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape.
Specifically, we show that a regularization term akin to a magnetic field can be modulated with a single scalar parameter to transition the loss function from a complex, non-convex landscape with exponentially many local minima, to a phase with a polynomial number of minima, all the way down to a trivial landscape with a unique minimum.