Diffusion models that are based on iterative denoising have been recently proposed and leveraged in various generation tasks like image generation.
Despite the emergence of principled methods for domain adaptation under label shift, their sensitivity to shifts in class conditional distributions is precariously under explored.
Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer.
Ranked #2 on Science Question Answering on ScienceQA
We discover the following design patterns: (i) group layers in a spindle pattern; (ii) allocate the number of trainable parameters to layers uniformly; (iii) tune all the groups; (iv) assign proper tuning strategies to different groups.
However, in real-world applications, it is common for the training sets to have long-tailed distributions.
Formally, this task is broken down into two subtasks: (i) Mixture Proportion Estimation (MPE) -- determining the fraction of positive examples in the unlabeled data; and (ii) PU-learning -- given such an estimate, learning the desired positive-versus-negative classifier.
Formally, this task is broken down into two subtasks: (i) Mixture Proportion Estimation (MPE)---determining the fraction of positive examples in the unlabeled data; and (ii) PU-learning---given such an estimate, learning the desired positive-versus-negative classifier.
We design automated supervised learning systems for data tables that not only contain numeric/categorical columns, but text fields as well.
Goal-oriented reinforcement learning algorithms are often good at exploration, not exploitation, while episodic algorithms excel at exploitation, not exploration.
In reinforcement learning, a map with states and transitions built based on historical trajectories is often helpful in exploration and exploitation.
We present TraDE, a self-attention-based architecture for auto-regressive density estimation with continuous and discrete valued data.
Tiering is an essential technique for building large-scale information retrieval systems.
A key obstacle in automated analytics and meta-learning is the inability to recognize when different datasets contain measurements of the same variable.
We provide both theoretical and empirical evidence for the soundness of our approach through a necessary and sufficient decomposition of exchangeable time series into a global and a local part.
no code implementations • 29 Mar 2019 • Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael. I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek Murray, Kunle Olukotun, Dimitris Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Siddhartha Sen, Virginia Smith, Alex Smola, Dawn Song, Evan Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar
Machine learning (ML) techniques are enjoying rapidly increasing adoption.
Many graph analytics problems can be solved via iterative algorithms where the solutions are often characterized by a set of steady-state conditions.
Faced with distribution shift between training and test set, we wish to detect and quantify the shift, and to correct our classifiers without test set labels.
Knowledge bases (KB), both automatically and manually constructed, are often incomplete --- many valid facts can be inferred from the KB by synthesizing existing information.
In this work, we propose Canopy, a sampler based on Cover Trees that is exact, has guaranteed runtime logarithmic in the number of atoms, and is provably polynomial in the inherent dimensionality of the underlying parameter space.
In this paper, we consider a formulation of multitask learning that learns the relationships both between tasks and between features, represented through a task covariance and a feature covariance matrix, respectively.
In this context, the MMD may be used in two roles: first, as a discriminator, either directly on the samples, or on features of the samples.
It is well known that DANE algorithm does not match the communication complexity lower bounds.
Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting.
Knowing which words have been attended to in previous time steps while generating a translation is a rich source of information for predicting what words will be attended to in the future.
This paper builds upon our recent series of papers on fast stochastic methods for smooth nonconvex optimization [22, 23], with a novel analysis for nonconvex and nonsmooth functions.
Thus, we develop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively.
Ranked #5 on Visual Question Answering (VQA) on VQA v1 test-std
We demonstrate the empirical performance of our method through a concrete realization of asynchronous SVRG.
We consider the problem of Bayesian learning on sensitive datasets and present two simple but somewhat surprising results that connect Bayesian learning to "differential privacy:, a cryptographic approach to protect individual-level privacy while permiting database-level utility.
The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters.
Ranked #54 on Image Classification on MNIST
We study a novel spline-like basis, which we name the "falling factorial basis", bearing many similarities to the classic truncated power basis.
Although nonlinear variants of PCA and CCA have been proposed, these are computationally prohibitive in the large scale.
Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation.