We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially boosts zero-shot performance on unseen tasks.
Ranked #1 on Question Answering on OBQA
Most ML approaches focus on generalization performance on unseen data that are similar to the training data (In-Distribution, or IND).
One important challenge of applying deep learning to electronic health records (EHR) is the complexity of their multimodal structure.
The paradigm of pretraining' from a set of relevant auxiliary tasks and thenfinetuning' on a target task has been successfully applied in many different domains.
Given a quantum circuit, a quantum computer can sample the output distribution exponentially faster in the number of bits than classical computers.
Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network.
When training the parameters of a linear dynamical model, the gradient descent algorithm is likely to fail to converge if the squared-error loss is used as the training loss function.
(2) The update of the flow model approximately minimizes the Jensen-Shannon divergence between the flow model and the data distribution.
Ranked #4 on Image Generation on CelebA-HQ 64x64
Time series data are prevalent in electronic health records, mostly in the form of physiological parameters such as vital signs and lab tests.
The use of collaborative and decentralized machine learning techniques such as federated learning have the potential to enable the development and deployment of clinical risk predictions models in low-resource settings without requiring sensitive data be shared or stored in a central repository.
In this paper, we investigate the learning biases that affect the efficacy and compositionality of emergent languages.
Clinical notes in electronic health records contain highly heterogeneous writing styles, including non-standard terminology or abbreviations.
A recent study showed that using the graphical structure underlying EHR data (e. g. relationship between diagnoses and treatments) improves the performance of prediction tasks such as heart failure prediction.
We further show that RNNs with only Bayesian embeddings can be a more efficient way to capture model uncertainty compared to ensembles, and we analyze how model uncertainty is impacted across individual input features and patient subgroups.
1 code implementation • • Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, Slav Petrov
The public release consists of 307, 373 training examples with single annotations, 7, 830 examples with 5-way annotations for development data, and a further 7, 842 examples 5-way annotated sequestered as test data.
Ranked #7 on Question Answering on Natural Questions (long)
In this paper, we present Smart Compose, a novel system for generating interactive, real-time suggestions in Gmail that assists users in writing mails by reducing repetitive typing.
This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length.
Ranked #3 on Music Modeling on JSB Chorales
no code implementations • 20 Aug 2018 • Samuel S. Schoenholz, Sean Hackett, Laura Deming, Eugene Melamud, Navdeep Jaitly, Fiona McAllister, Jonathon O'Brien, George Dahl, Bryson Bennett, Andrew M. Dai, Daphne Koller
As in many other scientific domains, we face a fundamental problem when using machine learning to identify proteins from mass spectrometry data: large ground truth datasets mapping inputs to correct outputs are extremely difficult to obtain.
Ideally, we could incorporate our prior knowledge of this hierarchical structure into unsupervised learning algorithms that work on text data.
Despite recent advances in training recurrent neural networks (RNNs), capturing long-term dependencies in sequences remains a fundamental challenge.
no code implementations • 24 Jan 2018 • Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M. Dai, Nissan Hajaj, Peter J. Liu, Xiaobing Liu, Mimi Sun, Patrik Sundberg, Hector Yee, Kun Zhang, Gavin E. Duggan, Gerardo Flores, Michaela Hardt, Jamie Irvine, Quoc Le, Kurt Litsch, Jake Marcus, Alexander Mossin, Justin Tansuwan, De Wang, James Wexler, Jimbo Wilson, Dana Ludwig, Samuel L. Volchenboum, Katherine Chou, Michael Pearson, Srinivasan Madabushi, Nigam H. Shah, Atul J. Butte, Michael Howell, Claire Cui, Greg Corrado, Jeff Dean
Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality.
Additionally, these models are typically trained via maxi- mum likelihood and teacher forcing.
One challenge in applying such techniques to building goal-oriented conversation models is that maximum likelihood-based models are not optimized toward accomplishing goals.
Neural autoregressive and seq2seq models that generate text by sampling words sequentially, with each word conditioned on the previous model, are state-of-the-art for several machine translation and summarization benchmarks.
Ranked #4 on Multivariate Time Series Imputation on PEMS-SF
Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost.
We also show that our method performs better than competing algorithms by Welinder and Perona (2010), and by Mnih and Hinton (2012).
Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting.
Ranked #15 on Sentiment Analysis on IMDb
The standard recurrent neural network language model (RNNLM) generates sentences one word at a time and does not work from an explicit global sentence representation.
In our experiments, we find that long short term memory recurrent networks after being pretrained with the two approaches are more stable and generalize better.
Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts.
However, until now, Hierarchical Dirichlet Process (HDP) mixtures have not seen significant use in supervised problems with grouped data since a straightforward application of the HDP on the grouped data results in learnt clusters that are not predictive of the responses.