Search Results for author: Daniel Lemire

Found 33 papers, 30 papers with code

Validating UTF-8 In Less Than One Instruction Per Byte

4 code implementations6 Oct 2020 John Keiser, Daniel Lemire

The majority of text is stored in UTF-8, which must be validated on ingestion.

Databases

Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters

9 code implementations17 Dec 2019 Thomas Mueller Graf, Daniel Lemire

We find that xor filters can be faster than Bloom and cuckoo filters while using less memory.

Data Structures and Algorithms

Parsing Gigabytes of JSON per Second

6 code implementations22 Feb 2019 Geoff Langdale, Daniel Lemire

We are thus motivated to make JSON parsing as fast as possible.

Databases Performance

Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries

2 code implementations5 Feb 2019 Daniel Lemire, Owen Kaser, Nathan Kurz

Currently, the remainder of the division by a constant is computed from the quotient by a multiplication and a subtraction.

Mathematical Software Performance

Fast Random Integer Generation in an Interval

3 code implementations28 May 2018 Daniel Lemire

We review an unbiased function to generate ranged integers from a source of random words that avoids integer divisions with high probability.

Data Structures and Algorithms

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

1 code implementation28 Feb 2018 Edmon Begoli, Jesús Camacho Rodríguez, Julian Hyde, Michael J. Mior, Daniel Lemire

Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD.

Databases

Stream VByte: Faster Byte-Oriented Integer Compression

6 code implementations25 Sep 2017 Daniel Lemire, Nathan Kurz, Christoph Rupp

To surpass varint-G8IU, we present Stream VByte, a novel byte-oriented compression technique that separates the control stream from the encoded data.

Roaring Bitmaps: Implementation of an Optimized Software Library

15 code implementations22 Sep 2017 Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai

Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries.

Databases

Faster Base64 Encoding and Decoding Using AVX2 Instructions

1 code implementation30 Mar 2017 Wojciech Muła, Daniel Lemire

Web developers use base64 formats to include images, fonts, sounds and other resources directly inside HTML, JavaScript, JSON and XML files.

Mathematical Software

Faster Population Counts Using AVX2 Instructions

3 code implementations23 Nov 2016 Wojciech Muła, Nathan Kurz, Daniel Lemire

Most processors have dedicated instructions to count the number of ones in a word (e. g., popcnt on x64 processors).

Data Structures and Algorithms

Upscaledb: Efficient Integer-Key Compression in a Key-Value Store using SIMD Instructions

no code implementations16 Nov 2016 Daniel Lemire, Christoph Rupp

In particular, a differentially coded SIMD binary-packing techniques (BP128) can offer a superior query speed (e. g., 40% better than an uncompressed database) while providing the best compression (e. g., by a factor of ten).

Databases

Regular and almost universal hashing: an efficient implementation

1 code implementation30 Sep 2016 Dmytro Ivanchykhin, Sergey Ignatchenko, Daniel Lemire

Many existing families of hash functions are universal: given two data objects, the probability that they have the same hash value is low given that we pick hash functions at random.

Data Structures and Algorithms Cryptography and Security

Consistently faster and smaller compressed bitmaps with Roaring

13 code implementations21 Mar 2016 Daniel Lemire, Gregory Ssi-Yan-Kai, Owen Kaser

To better handle these cases, we build a new Roaring hybrid that combines uncompressed bitmaps, packed arrays and RLE compressed segments.

Databases

Faster 64-bit universal hashing using carry-less multiplications

2 code implementations11 Mar 2015 Daniel Lemire, Owen Kaser

Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set in their x64 processors.

Data Structures and Algorithms Performance

Vectorized VByte Decoding

2 code implementations20 Feb 2015 Jeff Plaisance, Nathan Kurz, Daniel Lemire

We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes.

A General SIMD-based Approach to Accelerating Compression Algorithms

1 code implementation6 Feb 2015 Wayne Xin Zhao, Xu-Dong Zhang, Daniel Lemire, Dongdong Shan, Jian-Yun Nie, Hongfei Yan, Ji-Rong Wen

Compression algorithms are important for data oriented tasks, especially in the era of Big Data.

Measuring academic influence: Not all citations are equal

no code implementations26 Jan 2015 Xiaodan Zhu, Peter Turney, Daniel Lemire, André Vellino

Unlike the conventional h-index, it weights citations by how many times a reference is mentioned.

feature selection

Bloofi: Multidimensional Bloom Filters

2 code implementations8 Jan 2015 Adina Crainiceanu, Daniel Lemire

This problem cannot be solved by just constructing a Bloom filter on the union of all the sets.

Databases Data Structures and Algorithms

Better bitmap performance with Roaring bitmaps

14 code implementations26 Feb 2014 Samy Chambi, Daniel Lemire, Owen Kaser, Robert Godin

On synthetic and real data, we find that Roaring bitmaps (1) often compress significantly better (e. g., 2 times) and (2) are faster than the compressed alternatives (up to 900 times faster for intersections).

Databases

Compressed bitmap indexes: beyond unions and intersections

4 code implementations18 Feb 2014 Owen Kaser, Daniel Lemire

Compressed bitmap indexes are used to speed up simple aggregate queries in databases.

Databases Data Structures and Algorithms

SIMD Compression and the Intersection of Sorted Integers

4 code implementations24 Jan 2014 Daniel Lemire, Leonid Boytsov, Nathan Kurz

We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes.

Information Retrieval Databases Performance

Decoding billions of integers per second through vectorization

2 code implementations10 Sep 2012 Daniel Lemire, Leonid Boytsov

In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers.

Reordering Rows for Better Compression: Beyond the Lexicographic Order

3 code implementations9 Jul 2012 Daniel Lemire, Owen Kaser, Eduardo Gutarra

For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression.

Databases H.4.0

Strongly universal string hashing is fast

4 code implementations22 Feb 2012 Owen Kaser, Daniel Lemire

Our tests include hash functions designed for processors with the Carry-Less Multiplication (CLMUL) instruction set.

Databases Data Structures and Algorithms

The universality of iterated hashing over variable-length strings

2 code implementations10 Aug 2010 Daniel Lemire

Iterated hash functions process strings recursively, one character at a time.

Databases Data Structures and Algorithms

Reordering Columns for Smaller Indexes

2 code implementations7 Sep 2009 Daniel Lemire, Owen Kaser

Column-oriented indexes-such as projection or bitmap indexes-are compressed by run-length encoding to reduce storage and increase speed.

Databases

Sorting improves word-aligned bitmap indexes

6 code implementations23 Jan 2009 Daniel Lemire, Owen Kaser, Kamel Aouiche

Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage.

Databases

Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

no code implementations13 Jul 2007 Owen Kaser, Daniel Lemire

We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed.

Recursive n-gram hashing is pairwise independent, at best

2 code implementations31 May 2007 Daniel Lemire, Owen Kaser

We prove that recursive hash families cannot be more than pairwise independent.

Slope One Predictors for Online Rating-Based Collaborative Filtering

1 code implementation24 Feb 2007 Daniel Lemire, Anna Maclachlan

Rating-based collaborative filtering is the process of predicting how a user would rate a given item from other user ratings.

Collaborative Filtering

Streaming Maximum-Minimum Filter Using No More than Three Comparisons per Element

1 code implementation9 Oct 2006 Daniel Lemire

The running maximum-minimum (max-min) filter computes the maxima and minima over running windows of size w. This filter has numerous applications in signal processing and time series analysis.

Data Structures and Algorithms F.2.1

A Better Alternative to Piecewise Linear Time Series Segmentation

1 code implementation24 May 2006 Daniel Lemire

We propose an adaptive time series model where the polynomial degree of each interval vary (constant, linear and so on).

Time Series Time Series Analysis

Cannot find the paper you are looking for? You can Submit a new open access paper.