Ultrapacking: A Layer Over Bit-Packing

Table of Contents

1. Introduction

When dealing with integer arrays, applications often perform some form of lightweight compression. Columnar database engines, for instance, typically compress integer columns to both reduce storage space and increase effective memory bandwidth (i.e. scan more values per byte).

Lightweight compression is a very well-studied problem, with numerous variants on the four classic classic techniques: bitpacking, dictionary, delta, and frame-of-reference. Some research shows how well they compose, as in BtrBlocks (2). Other research provides variants on these techniques to expose opportunities for vectorization or otherwise increase performance (1).

In this post, I want to explore a layer on top of bitpacking that provides additional space savings. Bitpacking works by storing values in the minimum number of bits possible. If we know that every element fits in the range \([0, n)\), and \(n\) is representable in \(b\) bits, we can store a sequence of \(b\) -bit blocks contiguously in memory.

But consider the case of \(n = 5\). With ordinary bitpacking, we need \(b = 3\) bits, despite \(3\) bits holding \(8\) values, wasting \(\frac{3}{8}\) of the available space. The fundamental issue is that bitpacking uses \(b = \lceil \log_2{n} \rceil\) bits per value. For small \(b\), this ceil operation can waste a nontrivial amount of space (\(37.5\%\) in the case of \(n = 5\)). To address this problem I propose a scheme called ultrapacking, which groups values together to use nearly \(\log_2{n}\) bits per value. The first plot below shows space requirements for bitpacking and ultrapacking as a function of \(n\). The one below that shows the percent space savings. Notice that the space savings drop to \(0\) when \(n\) is a power of \(2\), as in this case \(\lceil \log_2{n} \rceil = \log_2{n}\).

2. Ultrapacking

As a motivating example, suppose we have \(n = 10\), and we want to compress a sequence of three values \([9, 0, 5]\). One way we could represent this sequence is using the number \(905\), which we obtain by concatenating the elements as digits in a number. The maximum possible number in this scheme is \(999\), which takes \(10\) bits. In other words, any group of \(3\) numbers can be stored in \(10\) bits. So given a stream of these numbers we can pack them into these \(10\) -bit groups, and then bitpack those.

This gives an intuition for the compression scheme in its general form: Given an \(n\), take \(k\) elements from your input stream \(a_1, \dots, a_k\) to form a bundle. Then, compute the number represented by concatenating the bundle into a base \(n\) number \((a_1a_2\dots a_k)_n\). The computer will naturally store this number in binary using \(b\) bits, which we can bitpack as normal.

Of course, since memory is byte-addressable, we can't just output \(b\) bits (\(10\) in our example). Instead, we must always output a multiple of \(8\) bits by compressing multiple bundles into a group. The number of bits in a group is given by \(LCM(b, 8)\), and so the number of bundles in a group is \(\frac{LCM(b, 8)}{8}\).

uint64_t Compress_Bundle_N_10_K_3(const uint32_t **input)
{
    constexpr size_t N = 10;
    constexpr size_t K = 3;
    uint64_t bundle = 0;
    for(size_t i = 0; i < K; i++)
    {
        bundle *= N;
        bundle += input[i];
    }
    (*input) += K;
    return bundle;
}

uint64_t Compress_N_10_K_3(const uint32_t *input, size_t n, uint8_t *output)
{
    const uint32_t end = input + n;
    constexpr uint64_t group_length = 4; // LCM(10, 8) = 40 bits = 5 bytes, 4 groups
    constexpr uint64_t group_bytes = 5;

    while(input < end)
    {
        uint64_t group = 0;
        for(size_t i = 0; i < group_length; i++)
            group |= (Compress_Bundle_N_10_K_3(&input) << (10 * i));
        *(uint64_t *)output = group;
        output += group_bytes;
    }
}

Decompression is exactly the inverse process, decomposing a base \(n\) number into its constitutent digits.

uint64_t Decompress_Bundle_N_10_K_3(uint64_t group, uint32_t **output)
{
    constexpr size_t N = 10;
    constexpr size_t K = 3;
    constexpr uint64_t mask = (1 << N) - 1;
    uint64_t bundle = group & mask;
    for(size_t i = K - 1; i >= 0; i--)
    {
        output[i] = bundle % N;
        bundle /= N;
    }
    return (group >> N);
}

uint64_t Decompress_N_10_K_3(const uint8_t *input, size_t n, uint32_t *output)
{
    const uint32_t end = output + n;
    constexpr uint64_t group_length = 4; // LCM(10, 8) = 40 bits = 5 bytes, 4 groups
    constexpr uint64_t group_bytes = 5;

    while(output < end)
    {
        uint64_t group = *(uint64_t *)input;
        for(size_t i = 0; i < group_length; i++)
            group = Decompress_Bundle_N_10_K_3(group, &output);
        input += group_bytes;
    }
}

2.1. Choosing \(k\)

As we saw above, with \(n = 10\) and \(k = 3\), we have \(\frac{10}{3} = 3.33\) bits per value. Can we do better?

The absolute minimum we'd need is \(\log_2{10} = 3.32\), which is quite close. How did I find this \(k\)? Brute force! For a given \(n\), we simply iterate through every \(k\) and see which \(k\) gives the minimum bits per value. As a slight detail, to allow us to bitpack using unaligned 64-bit stores, we ensure that each group consumes at most 56 bits. We also bias the packing slightly to prefer smaller bundles.

int ComputeK(int n)
{
    double bits_per_value = 10000;
    int bits_per_group = 0;
    int elts_per_group = 0;
    for(int num_bits_per_group = 1; num_bits_per_group <= 56; num_bits_per_group++)
    {
        for(int num_elts_per_group = 1; num_elts_per_group < 100; num_elts_per_group++)
        {
            if(log(2) * num_bits_per_group < log(n) * num_elts_per_group)
                break;
            double bpv = static_cast<double>(num_bits_per_group) / static_cast<double>(num_elts_per_group);
            if(bpv < bits_per_value - 0.05)
            {
                bits_per_value = bpv;
                bits_per_group = num_bits_per_group;
                elts_per_group = num_elts_per_group;
            }
        }
    }
    return elts_per_group;
}

3. Performance

To test this idea, I made a program that generates functions for all \(n\) up to \(255\), and compared it to my implementation of bitpacking. Please note that this is all scalar code. Further work could go into trying to vectorize these. I benchmarked compressing and decompressing an array of 1 million \(32\) -bit integers. Overall, compression performance is not bad, about 2-3x slower than ordinary bitpacking at the low end. The slowdown was expected, as compression and decompression involve multiplications, divisions, and mods that are much slower than simple shifts. Nevertheless, \(3\) cycles/value is still billions of integers per second!

4. Conclusion

Ultrapacking provides a way to eek out a bit more space-savings over traditional bitpacking. Future work could include vectorizing the implementation, and investigating filter and projection pushdowns. As it stands, the space savings don't seem worth the performance hit. Still, I hope this experiment was at least mildly entertaining. Thanks for reading!

All code is available here.

5. References

[1] Daniel Lemire and Leonid Boytsov, Decoding Billions of Integers Per Second Through Vectorization, arXiv, 2012.

[2] Maximilian Kuschewski and David Sauerwein and Adnan Alhomssi and Viktor Leis, Btrblocks: Efficient Columnar Compression for Data Lakes, Proceedings of the ACM on Management of Data, 2023.

Author: Sasha Krassovsky

Created: 2024-06-12 Wed 13:22

Validate