Pre-training (PT) followed by fine-tuning (FT) is an effective method for training neural networks, and has led to significant performance improvements in many domains.
We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables.
Standard first-order stochastic optimization algorithms base their updates solely on the average mini-batch gradient, and it has been shown that tracking additional quantities such as the curvature can help de-sensitize common hyperparameters.
Energy-Based Models (EBMs) present a flexible and appealing way to represent uncertainty.
Differential equations parameterized by neural networks become expensive to solve numerically as training progresses.
Standard variational lower bounds used to train latent variable models produce biased estimates of most quantities of interest.
Explanations of time series models are useful for high stakes applications like healthcare but have received little attention in machine learning literature.
We estimate the Stein discrepancy between the data density $p(x)$ and the model density $q(x)$ defined by a vector function of the data.
The adjoint sensitivity method scalably computes gradients of solutions to ordinary differential equations.
Ranked #1 on Video Prediction on CMU Mocap-2
In this setting, the standard class probabilities can be easily computed as well as unnormalized values of p(x) and p(x|y).
We propose an algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations.
Our model generates graphs one block of nodes and associated edges at a time.
We propose a method to automatically compute the importance of features at every observation in time series, by simulating counterfactual trajectories given previous observations.
Time series with non-uniform intervals occur in many applications, and are difficult to model using standard recurrent neural networks (RNNs).
Ranked #1 on Multivariate Time Series Imputation on PhysioNet Challenge 2012 (mse (10^-3) metric)
Flow-based generative models parameterize probability distributions through an invertible transformation and can be trained by maximum likelihood.
Ranked #2 on Image Generation on MNIST
Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems.
We show that standard ResNet architectures can be made invertible, allowing the same model to be used for classification, density estimation, and generation.
Ranked #5 on Image Generation on MNIST
A popular matrix completion algorithm is matrix factorization, where ratings are predicted from combining learned user and item parameter vectors.
The result is a continuous-time invertible generative model with unbiased density estimation and one-pass sampling, while allowing unrestricted neural network architectures.
Ranked #1 on Density Estimation on CIFAR-10 (NLL metric)
Many deep learning algorithms can be easily fooled with simple adversarial examples.
We can rephrase this question to ask: which parts of the image, if they were not seen by the classifier, would most change its decision?
Recommender systems can be formulated as a matrix completion problem, predicting ratings from user and item parameter vectors.
Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network.
Ranked #2 on Multivariate Time Series Forecasting on MuJoCo
Machine learning models are often tuned by nesting optimization of model weights inside the optimization of hyperparameters.
We decompose the evidence lower bound to show the existence of a term measuring the total correlation between latent variables.
Furthermore, we show that the parameters used to increase the expressiveness of the approximation play a role in generalizing inference rather than simply improving the complexity of the approximation.
Variational Bayesian neural nets combine the flexibility of deep learning with Bayesian uncertainty estimation.
Gradient-based optimization is the foundation of deep learning and reinforcement learning.
The standard interpretation of importance-weighted autoencoders is that they maximize a tighter lower bound on the marginal likelihood than the standard evidence lower bound.
We propose a simple and general variant of the standard reparameterized gradient estimator for the variational evidence lower bound.
10 code implementations • 7 Oct 2016 • Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, Alán Aspuru-Guzik
We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation.
Reaction prediction remains one of the major challenges for organic chemistry, and is a pre-requisite for efficient synthetic planning.
We propose a general modeling and inference framework that composes probabilistic graphical models with deep learning methods and combines their respective strengths.
We introduce a convolutional neural network that operates directly on graphs.
Ranked #2 on Drug Discovery on HIV dataset
By tracking the change in entropy over this sequence of transformations during optimization, we form a scalable, unbiased estimate of the variational lower bound on the log marginal likelihood.
In practical Bayesian optimization, we must often search over structures with differing numbers of parameters.
Choosing appropriate architectures and regularization strategies for deep networks is crucial to good predictive performance.
This paper presents the beginnings of an automatic statistician, focusing on regression problems.
Despite its importance, choosing the structural form of the kernel in nonparametric regression remains a black art.