Back to main page

SIMD/SWAR


Back to main page

Contents

Comments or questions? E-mail: info@sirrida.de


(To top) Intro

This is a description of SIMD / SWAR operations as well as some routines performing them. You can download an implementation of the routines.
The library was partly inspired by the proposal of standardizing bit operations in C++ and could serve as such.
Some stuff found on other places of this site is repeated here. The routines described here usually have an additional parameter which describes the subword size.

During the history of processors the register sizes have been increased several times, usually by doubling the size. For example, the first microprocessor Intel 4004 (1971) had registers of a size of only 4 bit. Later cousins had 8, 16, 32, and 64 bit register sizes.
While these wide registers allow for huge numbers, very often only a small subrange of these numbers are really needed thus wasting memory, processing capacity and power usage.
On the other hand many operations act on several elements in the same way. In order to cope with this, some architectures sported some kind of mini-vectors such as 3d-now (AMD), MMX, SSE, AVX (Intel) or Altivec (PowerPC).

There are means to simulate some these functions with a fair amount of overhead (but without accessing each subelement one by one) as is done in my library. The emulation is most effective for small subwords within large words as the overhead per element is minimized.
The acronym "SWAR" was coined in fall 1996 by Henry G. (Hank) Dietz and Randall J. Fisher as an abbreviation to "SIMD within a register" where "SIMD" means "single instruction, multiple data".
Some operations such as the binary operations NOT, AND, OR and XOR are inherently working in parallel on arbitrary subwords since there is no interaction between bits on different places. However most others such as addition have to deal with such interactions.

As an example we work out how to get addition to work on subwords without letting carries from one subword disturb other subwords which will occur for an unprepared addition of whole words.

3 bit addition:

+ 000 001 010 011 100 101 110 111
000 000 001 010 011 100 101 110 111
001 001 010 011 100 101 110 111 1000
010 010 011 100 101 110 111 1000 1001
011 011 100 101 110 111 1000 1001 1010
100 100 101 110 111 1000 1001 1010 1011
101 101 110 111 1000 1001 1010 1011 1100
110 110 111 1000 1001 1010 1011 1100 1101
111 111 1000 1001 1010 1011 1100 1101 1110

As you may notice there is no carry to the next subword (overflow, extra leftmost digit) when both operands have their MSB clear (yellow quadrant).
One approach is to simply use these smaller values, i.e. set the MSB always to zero ("spacer bits") before and after the addition. This is very fast and easy but we loose these spacer bits for calculation. More interesting is utilizing these bits too at the expense of some more effort.
The 3 other quadrants can be constructed to be without overflow from the simple one by XORing the result's MSB with the MSB of both operands since XOR is a carry-less addition of single bits.
Amazingly, this even works when we additionally add 1, i.e. a+b+1. We also call this 1 "carry" (c). This extra 1 can be utilized e.g. for subtraction (see below).
Thus we may write the expected sum without carries between subwords as follows (M: MSB mask = 100, xx: mask of mantissa = 011):
simple_sum = (a & xx) + (b & xx) + c
sum = simple_sum ^ (a & M) ^ (b & M) = simple_sum ^ ((a ^ b) & M)

We can thus perform multiple additions at once on disjunct bit fields such as 8 4-bit-fields within a 32 bit word.
All we need to let this kind of calculation create the expected results is the relevant masks such as MSB. In this case even irregular subword partitions are possible which might be of interest, e.g. for holding dimension vectors or color values ("Hi-Color": 5+6+5 bits) of pixels.

By the way: a-b can be calculated as a+(-b) = a+(~b+1) = a+~b+1 using the algorithm just described.
Moreover, a-b-1 = a+(-b)-1 = a+(~b+1)-1 = a+~b.


(To top) Overview

I have started to provide a quite comprehensive library of such functions by using features of newer versions of C++. You will find these functions as well as some test code in swartmps.cpp.
Several routines are presumably published here for the first time.
On this site I assume that the underlying machine has a word size of a power of 2 using two's complement for integer numbers. For the very rare non-conforming systems I have not taken any precautions and thus you should not use the routines on them but with extreme care and exhaustive tests.
In most cases it is easily possible to provide means to support subword sizes which are not powers of 2 or even non-uniform ones but this implies more complex parameterizing, thus for the sake of simplicity this is left out.


(To top) Concrete classes

For convenience the template classes have standard implementations for the usual 8, 16, 32 and 64 bit number types. If supported, 128 bit is also implemented. There are these concrete types:

Base \ Bits 8 16 32 64 128
t_bits<...> t_bits_8 t_bits_16 t_bits_32 t_bits_64 t_bits_128
t_simd<...> t_simd_8 t_simd_16 t_simd_32 t_simd_64 t_simd_128
t_bfly<...> t_bfly_8 t_bfly_16 t_bfly_32 t_bfly_64 t_bfly_128
t_cef<...> t_cef_8 t_cef_16 t_cef_32 t_cef_64 t_cef_128
t_ce_right<...> t_ce_right_8 t_ce_right_16 t_ce_right_32 t_ce_right_64 t_ce_right_128
t_ce_left<...> t_ce_left_8 t_ce_left_16 t_ce_left_32 t_ce_left_64 t_ce_left_128
t_vrot<...> t_vrot_8 t_vrot_16 t_vrot_32 t_vrot_64 t_vrot_128

Also, there some test routines which check the integrity of the library and which can be seen as a sample application as well.


(To top) General functions which ought to be in std::


(To top) t_subword

This is the enumeration type t_subword of all subword sizes. It can be casted to int and then interpreted as the binary logarithm (log2) of the subword size in bits, i.e.
(int) (t_subword::swar_byte) == 3
A byte has 23 = 8 bits.

Name log2(bits) bits
bit 0 1
nyp 1 2
nibble 2 4
byte 3 8
word 4 16
dword 5 32
qword 6 64
dqword 7 128

(To top) t_tbm_mode

This is the enumeration type t_tbm_mode of all modes for tbm (trailing bit modification operations), see t_bits_base.

Name Value Description
lo_0 0x00 Set lower bits to 0
lo_1 0x01 Set lower bits to 1
lo_mask 0x01 Set lower bits (mask)
tgt_0 0x00 Set found least significant bit to 0
tgt_1 0x02 Set found least significant bit to 1
tgt_mask 0x02 Set found least significant bit (mask)
hi_0 0x00 Set upper bits to 0
hi_1 0x04 Set upper bits to 1
hi_keep 0x08 Keep upper bits
hi_invert 0x0c Invert upper bits
hi_mask 0x0c Treat upper bits (mask)
search_0 0x00 Search for least significant 0 bit
search_1 0x10 Search for least significant 1 bit
search_mask 0x10 Search for least significant (mask)

(To top) t_round_mode

This is the enumeration type t_round_mode of all floating point rounding modes, see also rounding modes of Java.
In some cases this mode can also be applied to integer arithmetic.
For unsigned functions u/d are preferred over f/c (same meaning).

Suffix Name Description -2 -1.6 -1.5 -1.4 -1 -0.6 -0.5 -0.4 0 0.4 0.5 0.6 1 1.4 1.5 1.6 2
e half_even round, *.5: to next even, banker's rounding, x87:0 -2 -2 -2 -1 -1 -1 0 0 0 0 0 1 1 1 2 2 2
f floor floor, toward -infinity, sar, x87:1 -2 -2 -2 -2 -1 -1 -1 -1 0 0 0 0 1 1 1 1 2
c ceil ceil, toward infinity, x87:2 -2 -1 -1 -1 -1 0 0 0 0 1 1 1 1 2 2 2 2
d down trunc, toward 0, chop, div, x87:3 -2 -1 -1 -1 -1 0 0 0 0 0 0 0 1 1 1 1 2
u up away from 0 -2 -2 -2 -2 -1 -1 -1 -1 0 1 1 1 1 2 2 2 2
o half_odd round, *.5: to next odd -2 -2 -1 -1 -1 -1 -1 0 0 0 1 1 1 1 1 2 2
hf half_floor round, *.5: floor -2 -2 -2 -1 -1 -1 -1 0 0 0 0 1 1 1 1 2 2
hc half_ceil round, *.5: ceil -2 -2 -1 -1 -1 -1 0 0 0 0 1 1 1 1 2 2 2
hd half_down round, *.5: trunk -2 -2 -1 -1 -1 -1 0 0 0 0 0 1 1 1 1 2 2
hu half_up round, *.5: away from 0 -2 -2 -2 -1 -1 -1 -1 0 0 0 1 1 1 1 2 2 2

(To top) t_bits_base

The template class t_bits_base is the base implementation for bit operations acting on a whole type.


(To top) Rotating


(To top) Shifting


(To top) Counting bits


(To top) Miscellaneous


(To top) Permuting


(To top) Multiplication


(To top) Carry-less multiplication

"Carry-less multiplication" (cl_mul). The properties are quite similar to the usual multiplication (and addition).
Properties (^ is the XOR operator; * is the cl_mul operator):

cl_mul and xor form a commutative ring.


(To top) Cyclic carry-less multiplication


Here is funny idea I've never seen elsewhere: "Cyclic carry-less multiplication" (ccl_mul). The properties are quite similar to the usual carry-less multiplication.
Properties (^ is the XOR operator; * is the ccl_mul operator):

ccl_mul and xor form a commutative ring.
I've implemented the inverse function by a bit matrix inversion by Gauß/Jordan elimination. Not nice, not fast, but feasible.


(To top) Operations on low bits with specified bit count



(To top) t_simd_base

Base implementation for SWAR operations using t_bits. The base type is splitted in subwords as described by the parameter sw.


(To top) Basic functions


(To top) Count bits/elements


(To top) Math

Wrap around, ignore overflows

Saturate overflows

Detect overflows

Averages

Miscellaneous


(To top) Comparisons

Comparisons returning MSB

Comparisons returning LSB

Comparisons returning masks

Other operations


(To top) Shift by common amount


(To top) Rotate by common amount


(To top) Shift by fieldwise specified amount


(To top) Rotate by fieldwise specified amount


(To top) Special functions


(To top) Horizontal math


(To top) Shuffle operations


(To top) CEF operations (via aux object)


(To top) CE operations (via aux object)


(To top) t_bfly_base

This structure is used to hold the configuration of butterfly-based operations as well as compress and expand.


(To top) t_cef_base

Compress/Expand flip left/right.


(To top) t_ce_right_base

Compress/Expand right, rest filled with 0.


(To top) t_ce_left_base

Compress/Expand left, rest filled with 0.


(To top) t_vrot_base

Faster variant of t_simd_base::vrol/vror when using the same rotate counts repeatedly.


(To top) Index

Here is a list of words, types and functions and where you find them.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z


You may bookmark this page as http://programming.sirrida.de?swar.html.
Last change: 2016-07-02