Not Speeding Up Delta Decoding With Value Speculation

I came across this very cool article (1) about "value speculation", where you speculate on the value of a variable that's dependent on the previous iteration of the loop in hopes of eliminating the effect of data dependencies in the common case. I decided to give it a try for accelerating Delta decoding. Spoiler: it didn't work. But I still learned something, and maybe you will too.

Delta encoding is quite simple: rather than storing a sequence of values, store a sequence of differences between successive values. Notice that this scheme requires each loop iteration to keep some data around from the previous iteration, introducing a data dependency.

void DeltaDecode(uint32_t *compressed, uint32_t *uncompressed, size_t n)
{
    uint32_t prev = 0;
    for(size_t i = 0; i < n; i++)
    {
        uint32_t delta = compressed[i];
        uint32_t val = prev + delta;
        uncompressed[i] = val;
        prev = val;
    }
}

What if we could somehow "guess" the value from the previous iteration? In my contrived example, suppose there's a 90% chance that each value is equal to some predicted value. In theory, we could speculate to the next loop iteration rather than wait on the load of the `compressed` value.

void DeltaDecode_Speculative(
    uint32_t *compressed,
    uint32_t *uncompressed,
    size_t n,
    uint32_t probably_value)
{
    uint32_t prev = 0;
    for(size_t i = 0; i < n; i++)
    {
        uint32_t delta = compressed[i];
        uint32_t val = prev + delta;
        if(val == probably_value)
        {
            uncompressed[i] = probably_value;
            prev = probably_value;
        }
        else
        {
            uncompressed[i] = val;
            prev = val;
        }
    }
}

Unfortunately, this ends up being roughly twice as slow. The original takes about \(1.03\) cycles/value while the speculative takes \(2.6\) cycles/value. Perf explains why: although we mispredict very few branches, we simply execute way more instructions (22,510,310 insns vs 34,610,360 insns). Plus, although we may potentially skip a read from the cache, we still must write the result back, making the optimization moot.

And this makes me appreciate the details of the setup in original post: the value is easily predictable AND the loop is performing an aggregation. Since the result is kept in a register, it never touches the cache in the common case. And so I conclude that value speculation does not apply when the result is an array.

Source code available here.

References

[1] Mazzoli, Francesco, Beating the L1 cache with value speculation, 2021.