Bit Packing With Packedarray
For a long time, I’ve been wanting to release my implementation of random access container capable of randomly accessing unsigned integers packed at the bit-level.
Meet PackedArray.
When you want to hold an unordered sequence of unsigned integers into memory, the C programming language lets you choose among 4 data types:
uint8_t
uint16_t
uint32_t
uint64_t
If your numbers are within the [0, 100000]
range, only 17 bits per integer are needed since 217 = 131072. However, you can’t use an array of uint16_t
because 16 bits are not enough to store numbers between 65536 and 100000. When you use the next available type, uint32_t
, you’re wasting 15 bits per integer which represents a 47% overhead in terms of storage requirements.
PackedArray
comes to the rescue e.g. when you’re in a desperate need for a uint9_t
or uint17_t
array. PackedArray
is a bit packer. Bit packing consists in saving memory by packing integers/items together at the bit-level:
b0 | b1 | b2 | … | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
i0 | i1 | i2 | i3 | i4 | i5 | i6 | i7 | i8 | i9 | … |
A PackedArray
is backed by an uint32_t
buffer. Several items end up being stored inside the same buffer cell, e.g. i0, i1, and i2. Some items span two buffer cells, e.g. i3, and i7. PackedArray
is responsible for encoding/decoding items into/from the storage buffer.
Discussion
PackedArray
’s main purpose is not data serialization or message encoding: for that, one would typically investigate fast compression algorithms like LZ4 or Snappy.
Random access and dense data are the key features.
PackedArray
is designed as a drop-in replacement for an unsigned integer array yet imposes the following requirements:
- you must know in advance how many bits are needed to hold a single item
- you must know in advance how many items you want to store
- when packing, behavior is undefined if items have more than the specified maximum bits per item
Instead of writing:
uint32_t* a = (uint32_t*)malloc(sizeof(uint32_t) * count);
...
value = a[i];
...
a[j] = value;
You write:
PackedArray* a = PackedArray_create(bitsPerItem, count);
...
value = PackedArray_get(a, i);
...
PackedArray_set(a, j, value);
There are also PackedArray_pack
and PackedArray_unpack
that operate on several items in a row. Those two could really have been named PackedArray_write
and PackedArray_read
but I decided “pack” / “unpack” conveys better something is happening under the hood.
// bulk packing / unpacking
PackedArray_pack(a, j, in, count);
PackedArray_unpack(a, j, out, count);
// the following are semantically equivalent
PackedArray_set(a, j, value);
PackedArray_pack(a, j, &value, 1);
value = PackedArray_get(a, i);
PackedArray_unpack(a, i, &value, 1);
Current implementations of PackedArray_pack
and PackedArray_unpack
are unrolled and generated with the help of the C preprocessor. Unrolling brings the following micro-benchmark speed gains (compiled with -O2
):
- 76%, Macbook Pro Mid 2010 (2,67GHz Core i7 M 620)
- 50%, Samsung Galaxy Note (1.4 GHz ARM Cortex-A9)
- 79%, iPhone 5 (1.3 GHz ARM Apple A6)
References implementations can be found inside the PackedArray.c
file yet are compiled out unless compiling in self-test mode. Those reference implementations can conceptually be viewed as loops around PackedArray_set
and PackedArray_get
.
To better understand the unrolling happening in PackedArray_pack
and PackedArray_unpack
, it is possible to generate a preprocessed version of PackedArray.c
, see instructions in README.md
.
PackedArray
is released under the WTFPL v2 and is available on my GitHub account.
Going SIMD
After having unrolled PackedArray_pack
and PackedArray_unpack
I started wondering about speeding it up further with SIMD instructions. Looking for prior art, it seems that I found the most comprehensive study on the subject: Decoding billions of integers per second through vectorization by Daniel Lemire and Leonid Boystov.
I initially tried to implement an SIMD version compatible with the non SIMD one. This would involve independent shifts on the vector components (available with NEON, doable by multiplying with SSE2) and also horizontal adds.
However, as described in the paper, it is possible to replace scalar shifts and logical operations (or, and, not) with their SIMD equivalent instructions. Packing with SIMD instructions processes integers 4 by 4 but imposes an interleaved layout in the storage buffer.
Interleaved layout, 13 bits per item:
b0 | b1 | b2 | b3 | … | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
i0 | i4 | i8a | i1 | i5 | i9a | i2 | i6 | i10a | i3 | i7 | i11a | i8b | … |
As a consequence, the data layout of PackedArraySIMD
isn’t compatible with its non-SIMD counterpart. In other words, you cannot use PackedArray
to unpack data packed with PackedArraySIMD
or the other way around.
PackedArraySIMD.c
implements packing at the bit-level using SSE2 or NEON instructions.
PackedArraySIMD
compiled as self-bench exhibits the following speed improvements over PackedArray
:
- 60%, Macbook Pro Mid 2010 (2,67GHz Core i7 M 620)
- 16%, Samsung Galaxy Note (1.4 GHz ARM Cortex-A9)
- 22%, iPhone 5 (1.3 GHz ARM Apple A6)
Since this is a micro-benchmark, you will experience different behaviors depending on your platform, the amount of data you process, and the access pattern.
It is also worth noting the implementations of PackedArraySIMD_pack
and PackedArraySIMD_unpack
require more plumbing than their non-SIMD counterparts. Additional computations are needed to find out and adjust a data window that can be processed 4 by 4 with SIMD instructions.
What’s next?
PackedArray
and PackedArraySIMD
serve as building blocks for more elaborate data structures. If you decide to build on top of it, I would appreciate you drop me a note :).
In any case, feedback on the implementation and/or suggestions are greatly appreciated. Particularly, I’m no SIMD expert so if you find ways to further improve the implementation, please fork the GitHub project and send me a pull request.