Not Speeding Up Delta Decoding With Value Speculation
I came across this very cool article (1) about "value speculation", where you speculate on the value of a variable that's dependent on the previous iteration of the loop in hopes of eliminating the effect of data dependencies in the common case. I decided to give it a try for accelerating Delta decoding. Spoiler: it didn't work. But I still learned something, and maybe you will too.
Delta encoding is quite simple: rather than storing a sequence of values, store a sequence of differences between successive values. Notice that this scheme requires each loop iteration to keep some data around from the previous iteration, introducing a data dependency.
void DeltaDecode(uint32_t *compressed, uint32_t *uncompressed, size_t n) { uint32_t prev = 0; for(size_t i = 0; i < n; i++) { uint32_t delta = compressed[i]; uint32_t val = prev + delta; uncompressed[i] = val; prev = val; } }
What if we could somehow "guess" the value from the previous iteration? In my contrived example, suppose there's a 90% chance that each value is equal to some predicted value. In theory, we could speculate to the next loop iteration rather than wait on the load of the `compressed` value.
void DeltaDecode_Speculative( uint32_t *compressed, uint32_t *uncompressed, size_t n, uint32_t probably_value) { uint32_t prev = 0; for(size_t i = 0; i < n; i++) { uint32_t delta = compressed[i]; uint32_t val = prev + delta; if(val == probably_value) { uncompressed[i] = probably_value; prev = probably_value; } else { uncompressed[i] = val; prev = val; } } }
Unfortunately, this ends up being roughly twice as slow. The original takes about \(1.03\) cycles/value while the speculative takes \(2.6\) cycles/value. Perf explains why: although we mispredict very few branches, we simply execute way more instructions (22,510,310 insns vs 34,610,360 insns). Plus, although we may potentially skip a read from the cache, we still must write the result back, making the optimization moot.
And this makes me appreciate the details of the setup in original post: the value is easily predictable AND the loop is performing an aggregation. Since the result is kept in a register, it never touches the cache in the common case. And so I conclude that value speculation does not apply when the result is an array.
Source code available here.
References
[1] Mazzoli, Francesco, Beating the L1 cache with value speculation, 2021.