4 code implementations • 6 Oct 2020 • John Keiser, Daniel Lemire
The majority of text is stored in UTF-8, which must be validated on ingestion.
Databases
9 code implementations • 17 Dec 2019 • Thomas Mueller Graf, Daniel Lemire
We find that xor filters can be faster than Bloom and cuckoo filters while using less memory.
Data Structures and Algorithms
6 code implementations • 22 Feb 2019 • Geoff Langdale, Daniel Lemire
We are thus motivated to make JSON parsing as fast as possible.
Databases Performance
2 code implementations • 5 Feb 2019 • Daniel Lemire, Owen Kaser, Nathan Kurz
Currently, the remainder of the division by a constant is computed from the quotient by a multiplication and a subtraction.
Mathematical Software Performance
3 code implementations • 28 May 2018 • Daniel Lemire
We review an unbiased function to generate ranged integers from a source of random words that avoids integer divisions with high probability.
Data Structures and Algorithms
1 code implementation • 28 Feb 2018 • Edmon Begoli, Jesús Camacho Rodríguez, Julian Hyde, Michael J. Mior, Daniel Lemire
Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD.
Databases
6 code implementations • 25 Sep 2017 • Daniel Lemire, Nathan Kurz, Christoph Rupp
To surpass varint-G8IU, we present Stream VByte, a novel byte-oriented compression technique that separates the control stream from the encoded data.
15 code implementations • 22 Sep 2017 • Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai
Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries.
Databases
1 code implementation • 30 Mar 2017 • Wojciech Muła, Daniel Lemire
Web developers use base64 formats to include images, fonts, sounds and other resources directly inside HTML, JavaScript, JSON and XML files.
Mathematical Software
3 code implementations • 23 Nov 2016 • Wojciech Muła, Nathan Kurz, Daniel Lemire
Most processors have dedicated instructions to count the number of ones in a word (e. g., popcnt on x64 processors).
Data Structures and Algorithms
no code implementations • 16 Nov 2016 • Daniel Lemire, Christoph Rupp
In particular, a differentially coded SIMD binary-packing techniques (BP128) can offer a superior query speed (e. g., 40% better than an uncompressed database) while providing the best compression (e. g., by a factor of ten).
Databases
1 code implementation • 30 Sep 2016 • Dmytro Ivanchykhin, Sergey Ignatchenko, Daniel Lemire
Many existing families of hash functions are universal: given two data objects, the probability that they have the same hash value is low given that we pick hash functions at random.
Data Structures and Algorithms Cryptography and Security
13 code implementations • 21 Mar 2016 • Daniel Lemire, Gregory Ssi-Yan-Kai, Owen Kaser
To better handle these cases, we build a new Roaring hybrid that combines uncompressed bitmaps, packed arrays and RLE compressed segments.
Databases
2 code implementations • 11 Mar 2015 • Daniel Lemire, Owen Kaser
Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set in their x64 processors.
Data Structures and Algorithms Performance
2 code implementations • 20 Feb 2015 • Jeff Plaisance, Nathan Kurz, Daniel Lemire
We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes.
1 code implementation • 6 Feb 2015 • Wayne Xin Zhao, Xu-Dong Zhang, Daniel Lemire, Dongdong Shan, Jian-Yun Nie, Hongfei Yan, Ji-Rong Wen
Compression algorithms are important for data oriented tasks, especially in the era of Big Data.
no code implementations • 26 Jan 2015 • Xiaodan Zhu, Peter Turney, Daniel Lemire, André Vellino
Unlike the conventional h-index, it weights citations by how many times a reference is mentioned.
2 code implementations • 8 Jan 2015 • Adina Crainiceanu, Daniel Lemire
This problem cannot be solved by just constructing a Bloom filter on the union of all the sets.
Databases Data Structures and Algorithms
14 code implementations • 26 Feb 2014 • Samy Chambi, Daniel Lemire, Owen Kaser, Robert Godin
On synthetic and real data, we find that Roaring bitmaps (1) often compress significantly better (e. g., 2 times) and (2) are faster than the compressed alternatives (up to 900 times faster for intersections).
Databases
4 code implementations • 18 Feb 2014 • Owen Kaser, Daniel Lemire
Compressed bitmap indexes are used to speed up simple aggregate queries in databases.
Databases Data Structures and Algorithms
4 code implementations • 24 Jan 2014 • Daniel Lemire, Leonid Boytsov, Nathan Kurz
We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes.
Information Retrieval Databases Performance
2 code implementations • 10 Sep 2012 • Daniel Lemire, Leonid Boytsov
In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers.
3 code implementations • 9 Jul 2012 • Daniel Lemire, Owen Kaser, Eduardo Gutarra
For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression.
Databases H.4.0
4 code implementations • 22 Feb 2012 • Owen Kaser, Daniel Lemire
Our tests include hash functions designed for processors with the Carry-Less Multiplication (CLMUL) instruction set.
Databases Data Structures and Algorithms
2 code implementations • 10 Aug 2010 • Daniel Lemire
Iterated hash functions process strings recursively, one character at a time.
Databases Data Structures and Algorithms
2 code implementations • 7 Sep 2009 • Daniel Lemire, Owen Kaser
Column-oriented indexes-such as projection or bitmap indexes-are compressed by run-length encoding to reduce storage and increase speed.
Databases
6 code implementations • 23 Jan 2009 • Daniel Lemire, Owen Kaser, Kamel Aouiche
Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage.
Databases
1 code implementation • 20 Nov 2008 • Daniel Lemire
We find that LB Improved-based search is faster.
no code implementations • 13 Jul 2007 • Owen Kaser, Daniel Lemire
We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed.
2 code implementations • 31 May 2007 • Daniel Lemire, Owen Kaser
We prove that recursive hash families cannot be more than pairwise independent.
1 code implementation • 24 Feb 2007 • Daniel Lemire, Anna Maclachlan
Rating-based collaborative filtering is the process of predicting how a user would rate a given item from other user ratings.
1 code implementation • 9 Oct 2006 • Daniel Lemire
The running maximum-minimum (max-min) filter computes the maxima and minima over running windows of size w. This filter has numerous applications in signal processing and time series analysis.
Data Structures and Algorithms F.2.1
1 code implementation • 24 May 2006 • Daniel Lemire
We propose an adaptive time series model where the polynomial degree of each interval vary (constant, linear and so on).