Search Results for author: Tim Kraska

Found 36 papers, 11 papers with code

PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design

no code implementations8 Mar 2024 Wenqi Jiang, Shuai Zhang, Boran Han, Jie Wang, Bernie Wang, Tim Kraska

Retrieval-augmented generation (RAG) can enhance the generation quality of large language models (LLMs) by incorporating external token databases.

Retrieval

Extract-Transform-Load for Video Streams

1 code implementation7 Oct 2023 Ferdinand Kossmann, Ziniu Wu, Eugenie Lai, Nesime Tatbul, Lei Cao, Tim Kraska, Samuel Madden

We find that no current system sufficiently fulfills both needs and therefore propose Skyscraper, a system tailored to V-ETL.

Self-Driving Cars

SEED: Domain-Specific Data Curation With Large Language Models

no code implementations1 Oct 2023 Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, Zihui Gu, Chunwei Liu, Michael Cafarella

SEED uses these generated modules to process most of the data records and dynamically decides when the LLM should step in to directly process some individual records, possibly using the data-access modules to retrieve relevant information from the data sources to assist the LLM in solving the task.

Code Generation Imputation +1

FactorJoin: A New Cardinality Estimation Framework for Join Queries

no code implementations11 Dec 2022 Ziniu Wu, Parimarjan Negi, Mohammad Alizadeh, Tim Kraska, Samuel Madden

Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries.

Attribute

LSI: A Learned Secondary Index Structure

1 code implementation11 May 2022 Andreas Kipf, Dominik Horn, Pascal Pfeil, Ryan Marcus, Tim Kraska

LSI works by building a learned index over a permutation vector, which allows binary search to performed on the unsorted base data using random access.

Bounding the Last Mile: Efficient Learned String Indexing

no code implementations29 Nov 2021 Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, Tim Kraska

RSS achieves this by using the minimal string prefix to sufficiently distinguish the data unlike most learned approaches which index the entire string.

Towards Practical Learned Indexing

1 code implementation11 Aug 2021 Mihail Stoian, Andreas Kipf, Ryan Marcus, Tim Kraska

Latest research proposes to replace existing index structures with learned models.

Partitioned Learned Bloom Filters

no code implementations ICLR 2021 Kapil Vaidya, Eric Knorr, Michael Mitzenmacher, Tim Kraska

Bloom filters are space-efficient probabilistic data structures that are used to test whether an element is a member of a set, and may return false positives.

Learned Indexes for a Google-scale Disk-based Database

no code implementations23 Dec 2020 Hussam Abu-Libdeh, Deniz Altınbüken, Alex Beutel, Ed H. Chi, Lyric Doshi, Tim Kraska, Xiaozhou, Li, Andy Ly, Christopher Olston

There is great excitement about learned index structures, but understandable skepticism about the practicality of a new method uprooting decades of research on B-Trees.

Cortex: Harnessing Correlations to Boost Query Performance

no code implementations12 Dec 2020 Vikram Nathan, Jialin Ding, Tim Kraska, Mohammad Alizadeh

Unlike prior work, Cortex can adapt itself to any existing primary index, whether single or multi-dimensional, to harness a broad variety of correlations, such as those that exist between more than two attributes or have a large number of outliers.

Attribute

Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads

no code implementations23 Jun 2020 Jialin Ding, Vikram Nathan, Mohammad Alizadeh, Tim Kraska

Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse.

Partitioned Learned Bloom Filter

no code implementations5 Jun 2020 Kapil Vaidya, Eric Knorr, Tim Kraska, Michael Mitzenmacher

Bloom filters are space-efficient probabilistic data structures that are used to test whether an element is a member of a set, and may return false positives.

RadixSpline: A Single-Pass Learned Index

no code implementations30 Apr 2020 Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann

Recent research has shown that learned models can outperform state-of-the-art index structures in size and lookup performance.

Context-Aware Parse Trees

no code implementations24 Mar 2020 Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marcus, Paul Petersen, Jesmin Jahan Tithi, Tim Mattson, Tim Kraska, Pradeep Dubey, Vivek Sarkar, Justin Gottschlich

The simplified parse tree (SPT) presented in Aroma, a state-of-the-art code recommendation system, is a tree-structured representation used to infer code semantics by capturing program \emph{structure} rather than program \emph{syntax}.

ARDA: Automatic Relational Data Augmentation for Machine Learning

1 code implementation21 Mar 2020 Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, David Karger

Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join.

BIG-bench Machine Learning Data Augmentation +2

Learning Multi-dimensional Indexes

no code implementations3 Dec 2019 Vikram Nathan, Jialin Ding, Mohammad Alizadeh, Tim Kraska

Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines.

SOSD: A Benchmark for Learned Indexes

1 code implementation29 Nov 2019 Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann

A groundswell of recent work has focused on improving data management systems with learned components.

Benchmarking Management

LISA: Towards Learned DNA Sequence Search

no code implementations10 Oct 2019 Darryl Ho, Jialin Ding, Sanchit Misra, Nesime Tatbul, Vikram Nathan, Vasimuddin Md, Tim Kraska

Next-generation sequencing (NGS) technologies have enabled affordable sequencing of billions of short DNA fragments at high throughput, paving the way for population-scale genomics.

Sherlock: A Deep Learning Approach to Semantic Data Type Detection

2 code implementations25 May 2019 Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Çağatay Demiralp, César Hidalgo

Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery.

Column Type Annotation Vocal Bursts Type Prediction +1

ALEX: An Updatable Adaptive Learned Index

no code implementations21 May 2019 Jialin Ding, Umar Farooq Minhas, JIA YU, Chi Wang, Jaeyoung Do, Yi-Nan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, Tim Kraska

The original work by Kraska et al. shows that a learned index beats a B+Tree by a factor of up to three in search time and by an order of magnitude in memory footprint.

VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository

1 code implementation12 May 2019 Kevin Hu, Neil Gaikwad, Michiel Bakker, Madelon Hulsebos, Emanuel Zgraggen, César Hidalgo, Tim Kraska, Guoliang Li, Arvind Satyanarayan, Çağatay Demiralp

Researchers currently rely on ad hoc datasets to train automated visualization tools and evaluate the effectiveness of visualization designs.

Benchmarking

Unknown Examples & Machine Learning Model Generalization

no code implementations24 Aug 2018 Yeounoh Chung, Peter J. Haas, Eli Upfal, Tim Kraska

Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data.

BIG-bench Machine Learning Selection bias

Automated Data Slicing for Model Validation:A Big data - AI Integration Approach

no code implementations16 Jul 2018 Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, Steven Euijong Whang

As machine learning systems become democratized, it becomes increasingly important to help users easily debug their models.

Clustering Fairness +1

Smallify: Learning Network Size while Training

no code implementations10 Jun 2018 Guillaume Leclerc, Manasi Vartak, Raul Castro Fernandez, Tim Kraska, Samuel Madden

As neural networks become widely deployed in different applications and on different hardware, it has become increasingly important to optimize inference time and model size along with model accuracy.

IDEBench: A Benchmark for Interactive Data Exploration

1 code implementation7 Apr 2018 Philipp Eichmann, Carsten Binnig, Tim Kraska, Emanuel Zgraggen

Existing benchmarks for analytical database systems such as TPC-DS and TPC-H are designed for static reporting scenarios.

Databases

A-Tree: A Bounded Approximate Index Structure

no code implementations30 Jan 2018 Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, Tim Kraska

At the core of our index is a tunable error parameter that allows a DBA to balance lookup performance and space consumption.

Databases

SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

no code implementations13 Jan 2018 Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, Tim Kraska

Given the limited GPU DRAM, SuperNeurons not only provisions the necessary memory for the training, but also dynamically allocates the memory for convolution workspaces to achieve the high performance.

Management Scheduling

The Case for Learned Index Structures

8 code implementations4 Dec 2017 Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis

Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not.

Management Position

TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries

no code implementations31 Jan 2015 Evan R. Sparks, Ameet Talwalkar, Michael J. Franklin, Michael. I. Jordan, Tim Kraska

The proliferation of massive datasets combined with the development of sophisticated analytical techniques have enabled a wide variety of novel applications such as improved product recommendations, automatic image tagging, and improved speech-driven interfaces.

MLI: An API for Distributed Machine Learning

no code implementations21 Oct 2013 Evan R. Sparks, Ameet Talwalkar, Virginia Smith, Jey Kottalam, Xinghao Pan, Joseph Gonzalez, Michael J. Franklin, Michael. I. Jordan, Tim Kraska

MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing.

BIG-bench Machine Learning

Cannot find the paper you are looking for? You can Submit a new open access paper.