Bit Packing With Packedarray
For a long time, I’ve been wanting to release my implementation of random access container capable of randomly accessing unsigned integers packed at the bit-level.
When you want to hold an unordered sequence of unsigned integers into memory, the C programming language lets you choose among 4 data types:
If your numbers are within the
[0, 100000] range, only 17 bits per integer are needed since 217 = 131072. However, you can’t use an array of
uint16_t because 16 bits are not enough to store numbers between 65536 and 100000. When you use the next available type,
uint32_t, you’re wasting 15 bits per integer which represents a 47% overhead in terms of storage requirements.
PackedArray comes to the rescue e.g. when you’re in a desperate need for a
PackedArray is a bit packer. Bit packing consists in saving memory by packing integers/items together at the bit-level:
PackedArray is backed by an
uint32_t buffer. Several items end up being stored inside the same buffer cell, e.g. i0, i1, and i2. Some items span two buffer cells, e.g. i3, and i7.
PackedArray is responsible for encoding/decoding items into/from the storage buffer.
Random access and dense data are the key features.
PackedArray is designed as a drop-in replacement for an unsigned integer array yet imposes the following requirements:
- you must know in advance how many bits are needed to hold a single item
- you must know in advance how many items you want to store
- when packing, behavior is undefined if items have more than the specified maximum bits per item
Instead of writing:
uint32_t* a = (uint32_t*)malloc(sizeof(uint32_t) * count); ... value = a[i]; ... a[j] = value;
PackedArray* a = PackedArray_create(bitsPerItem, count); ... value = PackedArray_get(a, i); ... PackedArray_set(a, j, value);
There are also
PackedArray_unpack that operate on several items in a row. Those two could really have been named
PackedArray_read but I decided “pack” / “unpack” conveys better something is happening under the hood.
// bulk packing / unpacking PackedArray_pack(a, j, in, count); PackedArray_unpack(a, j, out, count); // the following are semantically equivalent PackedArray_set(a, j, value); PackedArray_pack(a, j, &value, 1); value = PackedArray_get(a, i); PackedArray_unpack(a, i, &value, 1);
Current implementations of
PackedArray_unpack are unrolled and generated with the help of the C preprocessor. Unrolling brings the following micro-benchmark speed gains (compiled with
- 76%, Macbook Pro Mid 2010 (2,67GHz Core i7 M 620)
- 50%, Samsung Galaxy Note (1.4 GHz ARM Cortex-A9)
- 79%, iPhone 5 (1.3 GHz ARM Apple A6)
References implementations can be found inside the
PackedArray.c file yet are compiled out unless compiling in self-test mode. Those reference implementations can conceptually be viewed as loops around
To better understand the unrolling happening in
PackedArray_unpack, it is possible to generate a preprocessed version of
PackedArray.c, see instructions in
PackedArray is released under the WTFPL v2 and is available on my GitHub account.
After having unrolled
PackedArray_unpack I started wondering about speeding it up further with SIMD instructions. Looking for prior art, it seems that I found the most comprehensive study on the subject: Decoding billions of integers per second through vectorization by Daniel Lemire and Leonid Boystov.
I initially tried to implement an SIMD version compatible with the non SIMD one. This would involve independent shifts on the vector components (available with NEON, doable by multiplying with SSE2) and also horizontal adds.
However, as described in the paper, it is possible to replace scalar shifts and logical operations (or, and, not) with their SIMD equivalent instructions. Packing with SIMD instructions processes integers 4 by 4 but imposes an interleaved layout in the storage buffer.
Interleaved layout, 13 bits per item:
As a consequence, the data layout of
PackedArraySIMD isn’t compatible with its non-SIMD counterpart. In other words, you cannot use
PackedArray to unpack data packed with
PackedArraySIMD or the other way around.
PackedArraySIMD.c implements packing at the bit-level using SSE2 or NEON instructions.
PackedArraySIMD compiled as self-bench exhibits the following speed improvements over
- 60%, Macbook Pro Mid 2010 (2,67GHz Core i7 M 620)
- 16%, Samsung Galaxy Note (1.4 GHz ARM Cortex-A9)
- 22%, iPhone 5 (1.3 GHz ARM Apple A6)
Since this is a micro-benchmark, you will experience different behaviors depending on your platform, the amount of data you process, and the access pattern.
It is also worth noting the implementations of
PackedArraySIMD_unpack require more plumbing than their non-SIMD counterparts. Additional computations are needed to find out and adjust a data window that can be processed 4 by 4 with SIMD instructions.
PackedArraySIMD serve as building blocks for more elaborate data structures. If you decide to build on top of it, I would appreciate you drop me a note :).
In any case, feedback on the implementation and/or suggestions are greatly appreciated. Particularly, I’m no SIMD expert so if you find ways to further improve the implementation, please fork the GitHub project and send me a pull request.