no code implementations • 21 Sep 2023 • Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. Gonzalez, Ion Stoica, Hao Zhang
Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications.
On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage.
2 code implementations • 9 Jun 2023 • Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences.
We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing.
When re-finding items, users who forget or are uncertain about identifying details often rely on creative strategies for expressing their information needs -- complex queries that describe content elements (e. g., book characters or events), information beyond the document text (e. g., descriptions of book covers), or personal context (e. g., when they read a book).
Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis.
The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding.
1 code implementation • 13 Mar 2023 • Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang
As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144.
Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device.
In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner.
Specifically, (i) we demonstrate the effectiveness of learning a single transferable prompt from multiple source tasks to initialize the prompt for each target task; (ii) we show many target tasks can benefit each other from sharing prompt vectors and thus can be jointly learned via multitask prompt tuning.
To achieve this, we design a novel action space that allows flexible editing of the initial prompts covering a wide set of commonly-used components like instructions, few-shot exemplars, and verbalizers.
This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters.
It is expensive to collect training data for every possible domain that a vision model may encounter when deployed.
Representation learning often plays a critical role in reinforcement learning by managing the curse of dimensionality.
no code implementations • 16 Jul 2022 • Vainavi Viswanath, Kaushik Shivakumar, Justin Kerr, Brijen Thananjeyan, Ellen Novoseller, Jeffrey Ichnowski, Alejandro Escontrela, Michael Laskey, Joseph E. Gonzalez, Ken Goldberg
Cables are ubiquitous in many settings and it is often useful to untangle them.
We demonstrate that it is possible to fine-tune both ResNet-18 and BERT within the memory constraints of a Cortex-M class embedded device while outperforming current edge training methods in energy efficiency.
It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations.
In this type of attack, the goal of the attacker is to use poisoned updates to implant so-called backdoors into the learned model such that, at test time, the model's outputs can be fixed to a given target for certain inputs.
Large-scale semantic image annotation is a significant challenge for learning-based perception systems in robotics.
Our analyses consider (I) hundreds of Transformers trained in different settings, in which we systematically vary the amount of data, the model size and the optimization hyperparameters, (II) a total of 51 pretrained Transformers from eight families of Huggingface NLP models, including GPT2, BERT, etc., and (III) a total of 28 existing and novel generalization metrics.
1 code implementation • 28 Jan 2022 • Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica
Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations.
Inspired by recent computer vision results that find position-invariant attention performant in learning long-range relationships, our method, which we call GraphTrans, applies a permutation-invariant Transformer module after a standard GNN module.
Traditional computer vision models are trained to predict a fixed set of predefined categories.
Overparameterization is shown to result in poor test accuracy on rare subgroups under a variety of settings where subgroup information is known.
Meta-reinforcement learning (meta-RL) has proven to be a successful framework for leveraging experience from prior tasks to rapidly learn new related tasks, however, current meta-RL approaches struggle to learn in sparse reward environments.
We analyze NovelD thoroughly in MiniGrid and found that empirically it helps the agent explore the environment more uniformly with a focus on exploring beyond the boundary.
Grounding enables the model to retain syntax information from the input in thereby significantly improving generalization over complex inputs.
Goal-conditioned reinforcement learning (RL) can solve tasks in a wide range of domains, including navigation and manipulation, but learning to reach distant goals remains a central challenge to the field.
Viewing neural network models in terms of their loss landscapes has a long history in the statistical mechanics approach to learning, and in recent years it has received attention within machine learning proper.
First-order methods for quadratic optimization such as OSQP are widely used for large-scale machine learning and embedded optimal control, where many related problems must be rapidly solved.
We then present a new algorithm, Latent Space Safe Sets (LS3), which uses this representation for long-horizon tasks with sparse rewards.
no code implementations • 29 Jun 2021 • Priya Sundaresan, Jennifer Grannen, Brijen Thananjeyan, Ashwin Balakrishna, Jeffrey Ichnowski, Ellen Novoseller, Minho Hwang, Michael Laskey, Joseph E. Gonzalez, Ken Goldberg
We present two algorithms that enhance robust cable untangling, LOKI and SPiDERMan, which operate alongside HULK, a high-level planner from prior work.
Path planning, the problem of efficiently discovering high-reward trajectories, often requires optimizing a high-dimensional and multimodal reward function.
no code implementations • 14 Jun 2021 • Robert Avram, Jeffrey E. Olgin, Alvin Wan, Zeeshan Ahmed, Louis Verreault-Julien, Sean Abreau, Derek Wan, Joseph E. Gonzalez, Derek Y. So, Krishan Soni, Geoffrey H. Tison
Our results demonstrate that multiple purpose-built neural networks can function in sequence to accomplish the complex series of tasks required for automated analysis of real-world angiograms.
In this work, the decision-maker is given a deadline of $T$ rounds, where, on each round, it can adaptively choose which arms to pull and how many times to pull them; this distinguishes the number of decisions made (i. e., time or number of rounds) from the number of samples acquired (cost).
no code implementations • 4 Jun 2021 • Vainavi Viswanath, Jennifer Grannen, Priya Sundaresan, Brijen Thananjeyan, Ashwin Balakrishna, Ellen Novoseller, Jeffrey Ichnowski, Michael Laskey, Joseph E. Gonzalez, Ken Goldberg
Disentangling two or more cables requires many steps to remove crossings between and within cables.
In particular, we show that the "dot-product attention" that is the core of the Transformer's operation can be characterized as a kernel learning method on a pair of Banach spaces.
On all these tasks, ActNN compresses the activation to 2 bits on average, with negligible accuracy loss.
Humans have a remarkable ability to make decisions by accurately reasoning about future events, including the future behaviors and states of mind of other agents.
Our model transfers knowledge from pretrained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP.
Building reliable object detectors that are robust to domain shifts, such as various changes in context, viewpoint, and object appearances, is critical for real-world applications.
A recent trend in computer vision is to replace convolutions with transformers.
Machine learning applications such as finance and medicine demand accurate and justifiable predictions, barring most deep learning methods from use.
In this paper, we analyze the pros and cons of each method and propose the regulated difference of inverse visitation counts as a simple but effective criterion for IR.
Researchers and practitioners in the field of reinforcement learning (RL) frequently leverage parallel computation, which has led to a plethora of new algorithms and systems in the last few years.
no code implementations • 10 Nov 2020 • Jennifer Grannen, Priya Sundaresan, Brijen Thananjeyan, Jeffrey Ichnowski, Ashwin Balakrishna, Minho Hwang, Vainavi Viswanath, Michael Laskey, Joseph E. Gonzalez, Ken Goldberg
HULK successfully untangles a cable from a dense initial configuration containing up to two overhand and figure-eight knots in 97. 9% of 378 simulation experiments with an average of 12. 1 actions per trial.
Second, we present an algorithm for a fixed deadline setting, where we are given a time deadline and need to maximize the probability of finding the best arm.
Safety remains a central obstacle preventing widespread use of RL in the real world: learning new tasks in uncertain environments requires extensive exploration, but safety requires limiting exploration.
We show that the FQT gradient is an unbiased estimator of the QAT gradient, and we discuss the impact of gradient quantization on its variance.
In this work, we propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge and supports ad hoc team play.
We explore learning pixelwise correspondences between images of deformable objects in different configurations.
The goal of continual learning (CL) is to learn a sequence of tasks without suffering from the phenomenon of catastrophic forgetting.
1 code implementation • 1 Sep 2020 • Sicheng Zhao, Xiangyu Yue, Shanghang Zhang, Bo Li, Han Zhao, Bichen Wu, Ravi Krishna, Joseph E. Gonzalez, Alberto L. Sangiovanni-Vincentelli, Sanjit A. Seshia, Kurt Keutzer
To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled or unlabeled target domain.
Replacing BN with the recently-proposed Group Normalization (GN) can reduce gradient diversity and improve test accuracy.
Recent work learns contextual representations of source code by reconstructing tokens from their context.
Ranked #1 on Method name prediction on CodeSearchNet
Using these observations, we show that noise-augmentation on mixup training further increases boundary thickness, thereby combating vulnerability to various forms of adversarial attacks and OOD transforms.
Bird's-eye-view (BEV) is a powerful and widely adopted representation for road scenes that captures surrounding objects and their spatial locations, along with overall context in the scene.
Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches.
To address this, prior work combines neural networks with decision trees.
To address this, we present Neural Architecture-Recipe Search (NARS) to search both (a) architectures and (b) their corresponding training recipes, simultaneously.
Ranked #5 on Neural Architecture Search on ImageNet
We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint.
To that end, we first define three notions of regret for the welfare, the individual utilities of each agent and that of the mechanism.
We propose a masking mechanism for feature map reuse, so that memory and computational costs stay nearly constant as the search space expands.
Ranked #68 on Neural Architecture Search on ImageNet
Machine learning applications such as finance and medicine demand accurate and justifiable predictions, barring most deep learning methods from use.
no code implementations • 28 Mar 2020 • Aditya Ganapathi, Priya Sundaresan, Brijen Thananjeyan, Ashwin Balakrishna, Daniel Seita, Jennifer Grannen, Minho Hwang, Ryan Hoque, Joseph E. Gonzalez, Nawid Jamali, Katsu Yamane, Soshi Iba, Ken Goldberg
Robotic fabric manipulation is challenging due to the infinite dimensional configuration space, self-occlusion, and complex dynamics of fabrics.
Such a simple approach outperforms the meta-learning methods by roughly 2~20 points on current benchmarks and sometimes even doubles the accuracy of the prior methods.
Ranked #15 on Few-Shot Object Detection on MS-COCO (30-shot)
Sample-based learning model predictive control (LMPC) strategies have recently attracted attention due to their desirable theoretical properties and their good empirical performance on robotic tasks.
We address these challenges using interpretable deep visual representations for rope, extending recent work on dense object descriptors for robot manipulation.
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference.
Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech.
Sound Audio and Speech Processing
It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE).
We formalize the problem of trading-off DNN training time and memory requirements as the tensor rematerialization optimization problem, a generalization of prior checkpointing strategies.
Many organizations wish to collaboratively train machine learning models on their combined datasets for a common benefit (e. g., better medical research, or fraud detection).
Existing on-policy imitation learning algorithms, such as DAgger, assume access to a fixed supervisor.
In this work, we propose a task-aware feature generation (TFG) framework for compositional learning, which generates features of novel visual concepts by transferring knowledge from previously seen concepts.
Reinforcement learning (RL) for robotics is challenging due to the difficulty in hand-engineering a dense cost function, which can lead to unintended behavior, and dynamical uncertainty, which makes exploration and constraint satisfaction challenging.
However, neural classifiers are often extremely brittle when confronted with domain shift---changes in the input distribution that occur over time.
We show that TAFE-Net is highly effective in generalizing to new tasks or concepts and evaluate the TAFE-Net on a range of benchmarks in zero-shot and few-shot learning.
no code implementations • 29 Mar 2019 • Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael. I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek Murray, Kunle Olukotun, Dimitris Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Siddhartha Sen, Virginia Smith, Alex Smola, Dawn Song, Evan Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar
Machine learning (ML) techniques are enjoying rapidly increasing adoption.
We model rate selection as a stochastic multi-armed bandit (MAB) problem, where a finite set of transmission rates are modeled as independent bandit arms.
no code implementations • 9 Feb 2019 • Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, David A. Patterson
Serverless cloud computing handles virtually all the system administration operations needed to make it easier for programmers to use the cloud.
Current trends in Machine Learning~(ML) inference on hardware accelerated devices (e. g., GPUs, TPUs) point to alarmingly low utilization.
Serverless computing offers the potential to program the cloud in an autoscaling, pay-as-you go manner.
Distributed, Parallel, and Cluster Computing Databases
The dominant cost in production machine learning workloads is not training individual models but serving predictions from increasingly complex prediction pipelines spanning multiple models, machine learning frameworks, and parallel hardware accelerators.
Distributed, Parallel, and Cluster Computing
We develop a multi-task convolutional neural network (CNN) to classify multiple diagnoses from 12-lead electrocardiograms (ECGs) using a dataset comprised of over 40, 000 ECGs, with labels derived from cardiologist clinical interpretations.
Models optimized for accuracy on single images are often prohibitively slow to run on each frame in a video.
Driven by advances in computer vision and the falling costs of camera hardware, organizations are deploying video cameras en masse for the spatial monitoring of their physical premises.
We show that this interface meets the requirements for a broad range of hyperparameter search algorithms, allows straightforward scaling of search to large clusters, and simplifies algorithm implementation.
Larger networks generally have greater representational power at the cost of increased computational complexity.
Convolutional networks optimized for accuracy on challenging, dense prediction tasks are prohibitively slow to run on each frame in a video.
By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.
Designing neural networks for continuous-time stochastic processes is challenging, especially when observations are made irregularly.
Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation patterns, each of which typically exhibits opportunities for distributed computation.
no code implementations • 15 Dec 2017 • Ion Stoica, Dawn Song, Raluca Ada Popa, David Patterson, Michael W. Mahoney, Randy Katz, Anthony D. Joseph, Michael Jordan, Joseph M. Hellerstein, Joseph E. Gonzalez, Ken Goldberg, Ali Ghodsi, David Culler, Pieter Abbeel
With the increasing commoditization of computer vision, speech recognition and machine translation systems and the widespread deployment of learning-based back-end technologies such as digital advertising and intelligent infrastructures, AI (Artificial Intelligence) has moved from research labs to production.
While deeper convolutional networks are needed to achieve maximum accuracy in visual perception tasks, for many inputs shallower networks are sufficient.
We explore how Deep Neural Networks can represent meta-policies that switch among a set of previously learned policies, specifically in settings where the dynamics of a new scenario are composed of a mixture of previously learned dynamics and where the state observation is possibly corrupted by sensing noise.
Advances in deep learning have led to substantial increases in prediction accuracy but have been accompanied by increases in the cost of rendering predictions.
In this paper, we introduce Clipper, a general-purpose low-latency prediction serving system.
Linear causal analysis is central to a wide range of important application spanning finance, the physical sciences, and engineering.
In this work, we present Velox, a new component of the Berkeley Data Analytics Stack.
Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging.
Research on distributed machine learning algorithms has focused primarily on one of two extremes - algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints.