SIMD/SWAR

Intro
Overview
Concrete classes
Details
Index

Comments or questions? E-mail: info@sirrida.de

Intro

This is a description of SIMD / SWAR operations as well as some routines performing them. You can download an implementation of the routines.
The library was partly inspired by the proposal of standardizing bit operations in C++ and could serve as such.
Some stuff found on other places of this site is repeated here. The routines described here usually have an additional parameter which describes the subword size.

During the history of processors the register sizes have been increased several times, usually by doubling the size. For example, the first microprocessor Intel 4004 (1971) had registers of a size of only 4 bit. Later cousins had 8, 16, 32, and 64 bit register sizes.
While these wide registers allow for huge numbers, very often only a small subrange of these numbers are really needed thus wasting memory, processing capacity and power usage.
On the other hand many operations act on several elements in the same way. In order to cope with this, some architectures sported some kind of mini-vectors such as 3d-now (AMD), MMX, SSE, AVX (Intel) or Altivec (PowerPC).

There are means to simulate some these functions with a fair amount of overhead (but without accessing each subelement one by one) as is done in my library. The emulation is most effective for small subwords within large words as the overhead per element is minimized.
The acronym "SWAR" was coined in fall 1996 by Henry G. (Hank) Dietz and Randall J. Fisher as an abbreviation to "SIMD within a register" where "SIMD" means "single instruction, multiple data".
Some operations such as the binary operations NOT, AND, OR and XOR are inherently working in parallel on arbitrary subwords since there is no interaction between bits on different places. However most others such as addition have to deal with such interactions.

As an example we work out how to get addition to work on subwords without letting carries from one subword disturb other subwords which will occur for an unprepared addition of whole words.

3 bit addition:

+	000	001	010	011	100	101	110	111
000	000	001	010	011	100	101	110	111
001	001	010	011	100	101	110	111	1000
010	010	011	100	101	110	111	1000	1001
011	011	100	101	110	111	1000	1001	1010
100	100	101	110	111	1000	1001	1010	1011
101	101	110	111	1000	1001	1010	1011	1100
110	110	111	1000	1001	1010	1011	1100	1101
111	111	1000	1001	1010	1011	1100	1101	1110

As you may notice there is no carry to the next subword (overflow, extra leftmost digit) when both operands have their MSB clear (yellow quadrant).
One approach is to simply use these smaller values, i.e. set the MSB always to zero ("spacer bits") before and after the addition. This is very fast and easy but we loose these spacer bits for calculation. More interesting is utilizing these bits too at the expense of some more effort.
The 3 other quadrants can be constructed to be without overflow from the simple one by XORing the result's MSB with the MSB of both operands since XOR is a carry-less addition of single bits.
Amazingly, this even works when we additionally add 1, i.e. a+b+1. We also call this 1 "carry" (c). This extra 1 can be utilized e.g. for subtraction (see below).
Thus we may write the expected sum without carries between subwords as follows (M: MSB mask = 100, xx: mask of mantissa = 011):
simple_sum = (a & xx) + (b & xx) + c
sum = simple_sum ^ (a & M) ^ (b & M) = simple_sum ^ ((a ^ b) & M)

We can thus perform multiple additions at once on disjunct bit fields such as 8 4-bit-fields within a 32 bit word.
All we need to let this kind of calculation create the expected results is the relevant masks such as MSB. In this case even irregular subword partitions are possible which might be of interest, e.g. for holding dimension vectors or color values ("Hi-Color": 5+6+5 bits) of pixels.

By the way: a-b can be calculated as a+(-b) = a+(~b+1) = a+~b+1 using the algorithm just described.
Moreover, a-b-1 = a+(-b)-1 = a+(~b+1)-1 = a+~b.

Overview

I have started to provide a quite comprehensive library of such functions by using features of newer versions of C++. You will find these functions as well as some test code in swartmps.cpp.
Several routines are presumably published here for the first time.
On this site I assume that the underlying machine has a word size of a power of 2 using two's complement for integer numbers. For the very rare non-conforming systems I have not taken any precautions and thus you should not use the routines on them but with extreme care and exhaustive tests.
In most cases it is easily possible to provide means to support subword sizes which are not powers of 2 or even non-uniform ones but this implies more complex parameterizing, thus for the sake of simplicity this is left out.

Concrete classes

For convenience the template classes have standard implementations for the usual 8, 16, 32 and 64 bit number types. If supported, 128 bit is also implemented. There are these concrete types:

Base \ Bits	8	16	32	64	128
`t_bits<...>`	`t_bits_8`	`t_bits_16`	`t_bits_32`	`t_bits_64`	`t_bits_128`
`t_simd<...>`	`t_simd_8`	`t_simd_16`	`t_simd_32`	`t_simd_64`	`t_simd_128`
`t_bfly<...>`	`t_bfly_8`	`t_bfly_16`	`t_bfly_32`	`t_bfly_64`	`t_bfly_128`
`t_cef<...>`	`t_cef_8`	`t_cef_16`	`t_cef_32`	`t_cef_64`	`t_cef_128`
`t_ce_right<...>`	`t_ce_right_8`	`t_ce_right_16`	`t_ce_right_32`	`t_ce_right_64`	`t_ce_right_128`
`t_ce_left<...>`	`t_ce_left_8`	`t_ce_left_16`	`t_ce_left_32`	`t_ce_left_64`	`t_ce_left_128`
`t_vrot<...>`	`t_vrot_8`	`t_vrot_16`	`t_vrot_32`	`t_vrot_64`	`t_vrot_128`

Also, there some test routines which check the integrity of the library and which can be seen as a sample application as well.

General functions which ought to be in `std::`

M_abs(x,y): Absolute value.
M_gcd(x,y): Greatest common divisor.
M_odd(x): Return true for odd values, false for even.
M_sgn(x): Return the sign: <0: -1, =0: 0, >0: 1
M_min(x,y): Minimum value.
M_max(x,y): Maximum value.
M_abs_diff(x,y): Equivalent to abs(x-y) via a larger type; works for signed and unsigned. Result type ought to be unsigned.
M_sgn_diff(x,y): Equivalent to sgn(x-y) via a larger type; works for signed and unsigned.

`t_subword`

This is the enumeration type t_subword of all subword sizes. It can be casted to int and then interpreted as the binary logarithm (log₂) of the subword size in bits, i.e.
(int) (t_subword::swar_byte) == 3
A byte has 2³ = 8 bits.

Name	log₂(bits)	bits
`bit`	0	1
`nyp`	1	2
`nibble`	2	4
`byte`	3	8
`word`	4	16
`dword`	5	32
`qword`	6	64
`dqword`	7	128

`t_tbm_mode`

This is the enumeration type t_tbm_mode of all modes for tbm (trailing bit modification operations), see t_bits_base.

Name	Value	Description
`lo_0`	0x00	Set lower bits to 0
`lo_1`	0x01	Set lower bits to 1
`lo_mask`	0x01	Set lower bits (mask)
`tgt_0`	0x00	Set found least significant bit to 0
`tgt_1`	0x02	Set found least significant bit to 1
`tgt_mask`	0x02	Set found least significant bit (mask)
`hi_0`	0x00	Set upper bits to 0
`hi_1`	0x04	Set upper bits to 1
`hi_keep`	0x08	Keep upper bits
`hi_invert`	0x0c	Invert upper bits
`hi_mask`	0x0c	Treat upper bits (mask)
`search_0`	0x00	Search for least significant 0 bit
`search_1`	0x10	Search for least significant 1 bit
`search_mask`	0x10	Search for least significant (mask)

`t_round_mode`

This is the enumeration type t_round_mode of all floating point rounding modes, see also rounding modes of Java.
In some cases this mode can also be applied to integer arithmetic.
For unsigned functions u/d are preferred over f/c (same meaning).

Suffix	Name	Description	-2	-1.6	-1.5	-1.4	-1	-0.6	-0.5	-0.4	0.4	0.5	0.6	1	1.4	1.5	1.6	2
`e`	`half_even`	round, *.5: to next even, banker's rounding, x87:0	-2	-2	-2	-1	-1	-1	0	0	0	0	1	1	1	2	2	2
`f`	`floor`	floor, toward -infinity, sar, x87:1	-2	-2	-2	-2	-1	-1	-1	-1	0	0	0	1	1	1	1	2
`c`	`ceil`	ceil, toward infinity, x87:2	-2	-1	-1	-1	-1	0	0	0	1	1	1	1	2	2	2	2
`d`	`down`	trunc, toward 0, chop, div, x87:3	-2	-1	-1	-1	-1	0	0	0	0	0	0	1	1	1	1	2
`u`	`up`	away from 0	-2	-2	-2	-2	-1	-1	-1	-1	1	1	1	1	2	2	2	2
`o`	`half_odd`	round, *.5: to next odd	-2	-2	-1	-1	-1	-1	-1	0	0	1	1	1	1	1	2	2
`hf`	`half_floor`	round, *.5: floor	-2	-2	-2	-1	-1	-1	-1	0	0	0	1	1	1	1	2	2
`hc`	`half_ceil`	round, *.5: ceil	-2	-2	-1	-1	-1	-1	0	0	0	1	1	1	1	2	2	2
`hd`	`half_down`	round, *.5: trunk	-2	-2	-1	-1	-1	-1	0	0	0	0	1	1	1	1	2	2
`hu`	`half_up`	round, *.5: away from 0	-2	-2	-2	-1	-1	-1	-1	0	0	1	1	1	1	2	2	2

`t_bits_base`

The template class t_bits_base is the base implementation for bit operations acting on a whole type.

Rotating
Shifting
Counting bits
Miscellaneous
Permuting
Multiplication
Carry-less multiplication
Cyclic carry-less multiplication
Operations on low bits with specified bit count

Rotating

t_bits_base::rol(x,rot)
Rotate x left by rot. Gives correct results for all values of rot. Should usually be optimizable to one instruction (ROL).
t_bits_base::ror(x,rot)
Rotate x right by rot. Gives correct results for all values of rot. Should usually be optimizable to one instruction (ROR).

Shifting

t_bits_base::shl_fast(x,shift)
Logically/arithmetically shift x left by shift. Only as guaranteed by the C standard, i.e. 0 ≤ shift < bits.
Should usually be implementable by one instruction (SHL).
t_bits_base::shl_safe(x,shift)
Logically/arithmetically shift x left by shift. Gives correct results for all values of shift; please note that shift is unsigned.
t_bits_base::shr_fast(x,shift)
Logically shift x right by shift. Only as guaranteed by the C standard, i.e. 0 ≤ shift < bits.
Should usually be implementable by one instruction (SHR).
t_bits_base::shr_safe(x,shift)
Logically shift x right by shift. Gives correct results for all values of shift; please note that shift is unsigned.
t_bits_base::sar_fast(x,shift)
Arithmetically shift right, duplicating the most significant (sign) bit. Only as guaranteed by the C standard, i.e. 0 ≤ shift < bits.
Should usually be optimizable to one instruction (SAR).
t_bits_base::sar_safe(x,shift)
Arithmetically shift x right by shift. Gives correct results for all values of shift; please note that shift is unsigned.

Counting bits

t_bits_base::nr_1bits(x)
Count the set bits, aka population count. On newer x86 implementable as POPCNT.
See Hacker's Delight, 5.1 "Counting 1-Bits".
t_bits_base::nr_0bits(x)
Count the reset bits.
t_bits_base::nr_trailing_0bits(x)
On x86 implementable as TZCNT or using BSF. Also implementable by multiplying x&-x with a De Bruijn number, shifting, and a lookup table. Special case 0: word size in bits.
See Hacker's Delight, 5.4 "Counting Trailing 0's".
t_bits_base::nr_trailing_1bits(x)
Number of contiguous least significant set bits.
t_bits_base::nr_leading_0bits(x)
On x86 implementable using BSR or as LZCNT. Special case 0: word size in bits.
See Hacker's Delight, 5.4 "Counting Trailing 0's".
t_bits_base::nr_leading_1bits(x)
Number of contiguous most significant set bits
t_bits_base::is_parity_odd(x)
Count the set bits and return whether this is an odd number. Result as mask.
t_bits_base::is_parity_oddh(x)
Count the set bits and return whether this is an odd number. Result as MSB.
t_bits_base::is_parity_oddl(x)
Count the set bits and return whether this is an odd number. Result as LSB.

Miscellaneous

t_bits_base::sign_extend(x,b)
Sign extended the low b bits of x.
t_bits_base::auto blend(m,x,y)
The bit equivalent to m?x:y.
t_bits_base::gray_code(x)
Gray code.
See Hacker's Delight, 13.1 "Gray Code"
t_bits_base::inv_gray_code(x)
Inverse Gray code.
See Hacker's Delight, 13.1 "Gray Code"
t_bits_base::is_contiguous_1bits(x)
Is x a contiguous string of 1 bits?
t_bits_base::tbm(x,mode)
General trailing bit modification operations. Mode: The operating mode, see t_tbm_mode. All of these operations can be performed by ≤ 3 instructions. Some of these operations are realized as BMI1 or TBM instructions.
t_bits_base::isign(x,y)
The Fortran function ISIGN: copy sign for integer values
(y≥0) ? abs(x) : -abs(x)
Should be very fast.
See Hacker's Delight, 2.9 "Transfer of Sign"
t_bits_base::xsign(x,y)
Simple transfer sign
(y≥0) ? x : -x
Should be very fast.

Permuting

t_bits_base::bit_permute_step(x,m,shift)
Can be replaced by bit_permute_step_simple, if for the relevant bits n the following holds: nr_1bits(bit_permute_step_simple(n, m, shift)) = nr_1bits(n) shift should be in 0..bits-1. This is a candidate for a new instruction.
x86: ≥ 6/5 cycles
ARM: ≥ 4/4 cycles
t_bits_base::bit_permute_step_simple(x,m,shift)
Simplified replacement of bit_permute_step Can always be replaced by bit_permute_step (not vice-versa). shift should be in 0..bits-1.
x86: ≥ 5/4 (5/3) cycles
ARM: ≥ 3/2 cycles
t_bits_base::bit_permute_step2(x1,x2,m,shift)
Extended variant of bit_permute_step. Will be slow if not inlined. shift should be in 0..bits-1. This is a candidate for a new instruction.
t_bits_base::bit_permute_step_rot(x,m,shift)
This is similar to bit_permute_step but rotating instead of shifting. shift should be in 0..bits-1. This is a candidate for a new instruction.
x86: ≥ 6/5 cycles
ARM: ≥ 4/4 cycles
t_bits_base::bit_permute_step2_rot(x1,x2,m,shift)
This is similar to bit_permute_step2 but rotating instead of shifting. Extended variant of bit_permute_step_rot. Will be slow if not inlined. shift should be in 0..bits-1.
This is a candidate for a new instruction.

Multiplication

t_bits_base::mul_inv(x)
Calculate multiplicative inverse, i.e. mul_inv(x)*x==1. The result is defined for all odd values of x. For even x we simply return 0.
See Hacker's Delight, 10.16 "Exact division by constants" Multiplicative inverse modulo 2**bits by Newton's method.

Carry-less multiplication

"Carry-less multiplication" (cl_mul). The properties are quite similar to the usual multiplication (and addition).
Properties (^ is the XOR operator; * is the cl_mul operator):

Associative: a^(b^c) = (a^b)^c, a*(b*c) = (a*b)*c
Commutative: a^b = b^a, a*b = b*a
Distributive: a*(b^c) = (a*b)^(a*c), (a^b)*c = (a*c)^(b*c)
Neutral element: 0^a = a^0 = a, 1*a = a*1 = a
Zero element: 0*a = a*0 = 0
^ is always invertible: a^a = 0
* is invertible for odd values: x*inv(x) = inv(x)*x = 1
Shift function: shl(a,x)*shl(b,y) = shl(a*b,x+y)

cl_mul and xor form a commutative ring.

t_bits_base::cl_mat_inv(src,inv)
Bit matrix inversion by Gauß/Jordan elimination.
t_bits_base::cl_mul(x,y)
Carry-less multiplication, lower bits of result. Invertible by cl_mul_inv for odd numbers.
t_bits_base::cl_mul_inv(x)
Calculate carry-less multiplicative inverse, i.e. cl_mul(cl_mul_inv(x),x) == 1. The result is defined for all odd values of x. For even x we simply return 0.
t_bits_base::cl_power(x,y)
x to the y'th via cl_mul. For negative y see cl_mul_inv.

Cyclic carry-less multiplication

Here is funny idea I've never seen elsewhere: "Cyclic carry-less multiplication" (ccl_mul). The properties are quite similar to the usual carry-less multiplication.
Properties (^ is the XOR operator; * is the ccl_mul operator):

Associative: a^(b^c) = (a^b)^c, a*(b*c) = (a*b)*c
Commutative: a^b = b^a, a*b = b*a
Distributive: a*(b^c) = (a*b)^(a*c), (a^b)*c = (a*c)^(b*c)
Neutral element: 0^a = a^0 = a, 1*a = a*1 = a
Zero element: 0*a = a*0 = 0
^ is always invertible: a^a = 0
* is invertible for odd popcnt: x*inv(x) = inv(x)*x = 1
Rotate function: rot(a,x)*rot(b,y) = rot(a*b,x+y)

ccl_mul and xor form a commutative ring.
I've implemented the inverse function by a bit matrix inversion by Gauß/Jordan elimination. Not nice, not fast, but feasible.

t_bits_base::ccl_mul(x,y)
Cyclic carry-less multiplication. Invertible by ccl_mul_inv for numbers with odd parity.
t_bits_base::ccl_mul_inv(x)
Calculate cyclic carry-less multiplicative inverse, i.e. ccl_mul(x, ccl_mul_inv(x))==1. The result is defined for all values of x with odd popcount. For x with even popcount we simply return 0.
t_bits_base::ccl_power(x,y)
x to the y'th via ccl_mul. For negative y see ccl_mul_inv.

Operations on low bits with specified bit count

t_bits_base::mask_ex(b)
Calculate the mask for b bits to zero out the remaining bits.
t_bits_base::ror_ex(x,r,b)
Rotate right low b bits of x by r; remaining bits are zeroed out. = rol_ex(x,-r,b);
t_bits_base::rol_ex(x,r,b)
Rotate left low b bits of x by r; remaining bits are zeroed out. = ror_ex(x,-r,b);
t_bits_base::shr_ex(x,r,b)
Shift right low b bits of x by r; remaining bits are zeroed out.
t_bits_base::shl_ex(x,r,b)
Shift left low b bits of x by r; remaining bits are zeroed out.
t_bits_base::sar_ex(x,r,b)
Arithmetically (duplicating the MSB) shift right low b bits of x by r; remaining bits are sign extended.
t_bits_base::sal_ex(x,r,b)
Shift left low b bits of x by r duplicating the LSB; remaining bits are zeroed out.
t_bits_base::shr1_ex(x,r,b)
Shift right low b bits of x by r shifting in 1 bits; remaining bits are zeroed out.
t_bits_base::shl1_ex(x,r,b)
Shift left low b bits of x by r shifting in 1 bits; remaining bits are zeroed out.
t_bits_base::ccl_mul_ex(x,y,b)
Cyclic carry-less multiplication. Invertible by ccl_mul_inv_ex for numbers with odd parity.
t_bits_base::ccl_mul_inv_ex(x,b)
Calculate cyclic carry-less multiplicative inverse, i.e. ccl_mul(x, ccl_mul_inv(x))==1. The result is defined for all values of x with odd popcount. For x with even popcount we simply return 0.
t_bits_base::ccl_power_ex(x,y,b)
x to the y'th via ccl_mul. For negative y see ccl_mul_inv_ex.

`t_simd_base`

Base implementation for SWAR operations using t_bits. The base type is splitted in subwords as described by the parameter sw.

Basic functions
Count bits/elements
Math
Comparisons
Shift by common amount
Rotate by common amount
Shift by fieldwise specified amount
Rotate by fieldwise specified amount
Special functions
Horizontal math
Shuffle operations
CEF operations (via aux object)
CE operations (via aux object)

Basic functions

t_simd_base::constant(sw,c)
Fill every subword with c, i.e. broadcast. Also implementable by inductive doubling.
t_simd_base::element_mask(sw,i)
Create a mask for the subword #i.
t_simd_base::extract(sw,x,i)
Extract the subword #i of x.
t_simd_base::implant(sw,x,e,i)
Implant e into the subword #i of x.

Count bits/elements

t_simd_base::nr_1bits(sw,x)
Count the number of set bits. Equivalent to loop i in 0..sw-1: hadd_u(sw, x)
t_simd_base::nr_0bits(sw,x)
Count the reset bits
t_simd_base::nr_trailing_0bits(sw,x)
Count the number of trailing zero bits. ~x&(x-1): smear least significant 1 bit to the left. This is simpler than the parallel prefix method.
t_simd_base::nr_trailing_1bits(sw,x)
Number of contiguous least significant set bits
t_simd_base::nr_leading_0bits(sw,x)
Count the number of leading zero bits.
t_simd_base::nr_leading_1bits(sw,x)
Number of contiguous most significant set bits
t_simd_base::contains_0(sw,x)
True iff there is at least one subword with the value 0.
See http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
See Paul Curtis, 2004-12-15 in http://hackersdelight.org/corres.txt
See also http://9.douban.com/subject/9065343/
See discussion on http://stackoverflow.com/questions/20021066/ how-the-glibc-strlen-implementation-works Thereby we can also find the lowest zero subword. Not suitable to find all zeros, i.e. this can not replace eq0 (subwords left to the lowest zero might be flagged as zero).
t_simd_base::nr_trailing_ne0(sw,x)
See http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
See Paul Curtis, 2004-12-15 in http://hackersdelight.org/corres.txt
See contains_0.
t_simd_base::is_parity_odd(x)
Count the set bits and return whether this is an odd number by setting mask.
t_simd_base::is_parity_oddl(x)
Count the set bits and return whether this is an odd number by setting low bit.
t_simd_base::is_parity_oddh(x)
Count the set bits and return whether this is an odd number by setting high bit.

Math

Wrap around, ignore overflows

t_simd_base::add(sw,x,y)
Add with wrap-around x+y.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
t_simd_base::sub(sw,x,y)
Subtract with wrap-around x-y.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
t_simd_base::add1(sw,x,y)
Add with wrap-around x+y+1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
t_simd_base::sub1(sw,x,y)
Subtract with wrap-around x-y-1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
t_simd_base::addc(sw,x,y,carry)
Add with wrap-around x+y+carry. Only least significant bits of carry are used, i.e. carry may be 0 or 1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
t_simd_base::subc(sw,x,y,carry)
Subtract with wrap-around x-y-carry. Only least significant bits of carry are used, i.e. carry may be 0 or 1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
t_simd_base::inc(sw,x)
Increment with wrap-around. =add(sw, x, constant(sw, 1))
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
t_simd_base::dec(sw,x)
Decrement with wrap-around. =sub(sw, x, constant(sw, 1))
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
t_simd_base::neg(sw,x)
Negate with wrap-around. =sub(sw, 0, x)

Saturate overflows

t_simd_base::add_sat_u(sw,x,y)
Unsigned saturated add x+y.
See Falk Hueffner 2004-01-23 "Multibyte (or SIMD) arithmetic" in http://hackersdelight.org/corres.txt
t_simd_base::sub_sat_u(sw,x,y)
Unsigned saturated subtract x-y.
See Falk Hueffner 2004-01-23 "Multibyte (or SIMD) arithmetic" in http://hackersdelight.org/corres.txt
t_simd_base::add1_sat_u(sw,x,y)
Unsigned saturated add x+y+1.
See Falk Hueffner 2004-01-23 "Multibyte (or SIMD) arithmetic" in http://hackersdelight.org/corres.txt
t_simd_base::sub1_sat_u(sw,x,y)
Unsigned saturated subtract x-y-1.
See Falk Hueffner 2004-01-23 "Multibyte (or SIMD) arithmetic" in http://hackersdelight.org/corres.txt
t_simd_base::addc_sat_u(sw,x,y,carry)
Unsigned saturated add x+y+carry. Only least significant bits of carry are used, i.e. carry may be 0 or 1.
See Falk Hueffner 2004-01-23 "Multibyte (or SIMD) arithmetic" in http://hackersdelight.org/corres.txt
t_simd_base::subc_sat_u(sw,x,y,carry)
Unsigned saturated subtract x-y-carry. Only least significant bits of carry are used, i.e. carry may be 0 or 1.
See Falk Hueffner 2004-01-23 "Multibyte (or SIMD) arithmetic" in http://hackersdelight.org/corres.txt
t_simd_base::add_sat_s(sw,x,y)
Signed saturated add x+y.
See Falk Hueffner 2004-01-23 "Multibyte (or SIMD) arithmetic" in http://hackersdelight.org/corres.txt
t_simd_base::sub_sat_s(sw,x,y)
Signed saturated sub x-y. Derived from add_sat_s.
t_simd_base::add1_sat_s(sw,x,y)
Signed saturated add x+y+1.
See Falk Hueffner 2004-01-23 "Multibyte (or SIMD) arithmetic" in http://hackersdelight.org/corres.txt
t_simd_base::sub1_sat_s(sw,x,y)
Signed saturated sub x-y-1. Derived from add_sat_s.
t_simd_base::addc_sat_s(sw,x,y,carry)
Signed saturated add x+y+carry. Only least significant bits of carry are used, i.e. carry may be 0 or 1.
See Falk Hueffner 2004-01-23 "Multibyte (or SIMD) arithmetic" in http://hackersdelight.org/corres.txt
t_simd_base::subc_sat_s(sw,x,y,carry)
Signed saturated sub x-y-carry. Only least significant bits of carry are used, i.e. carry may be 0 or 1. Derived from add_sat_s.
t_simd_base::neg_sat_s(sw,x)
Signed saturated negate; the only value which will be saturated is the smallest possible signed number; see abs for a similar problem. =sub_sat_s(sw,0,x).

Detect overflows

t_simd_base::overflow_add_u(sw,x,y)
Detect overflow in unsigned addition x+y.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Unsigned Add/Subtract)
t_simd_base::overflow_sub_u(sw,x,y)
Detect overflow in unsigned subtraction x-y.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Unsigned Add/Subtract)
t_simd_base::overflow_add_s(sw,x,y)
Detect overflow in signed addition x+y.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Signed Add/Subtract)
t_simd_base::overflow_sub_s(sw,x,y)
Detect overflow in signed subtraction x-y.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Signed Add/Subtract)
t_simd_base::overflow_add1_u(sw,x,y)
Detect overflow in unsigned addition x+y+1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Unsigned Add/Subtract)
t_simd_base::overflow_sub1_u(sw,x,y)
Detect overflow in unsigned subtraction x-y-1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Unsigned Add/Subtract)
t_simd_base::overflow_add1_s(sw,x,y)
Detect overflow in signed addition x+y+1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Signed Add/Subtract)
t_simd_base::overflow_sub1_s(sw,x,y)
Detect overflow in signed subtraction x-y-1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Signed Add/Subtract)
t_simd_base::overflow_addc_u(sw,x,y,carry)
Detect overflow in unsigned addition x+y+carry. Only least significant bits of carry are used, i.e. carry may be 0 or 1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Unsigned Add/Subtract)
t_simd_base::overflow_subc_u(sw,x,y,carry)
Detect overflow in unsigned subtraction x-y-carry. Only least significant bits of carry are used, i.e. carry may be 0 or 1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Unsigned Add/Subtract)
t_simd_base::overflow_addc_s(sw,x,y,carry)
Detect overflow in signed addition x+y+carry. Only least significant bits of carry are used, i.e. carry may be 0 or 1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Signed Add/Subtract)
t_simd_base::overflow_subc_s(sw,x,y,carry)
Detect overflow in signed subtraction x-y-carry. Only least significant bits of carry are used, i.e. carry may be 0 or 1.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
See HD 2.13 "Overflow Detection" (Signed Add/Subtract)

Averages

t_simd_base::avgd_u(sw,x,y)
Unsigned average rounding down, i.e. (x+y)>>1.
See Falk Hueffner 2004-01-23 "Multibyte (or SIMD) arithmetic" in http://hackersdelight.org/corres.txt
t_simd_base::avgu_u(sw,x,y)
Unsigned average rounding up, i.e. (x+y+1)>>1.
See Falk Hueffner 2004-01-23 "Multibyte (or SIMD) arithmetic" in http://hackersdelight.org/corres.txt
t_simd_base::avgf_s(sw,x,y)
Signed average rounding toward -∞.
t_simd_base::avgc_s(sw,x,y)
Signed average rounding toward ∞.

Miscellaneous

t_simd_base::abs_diff_u(sw,x,y)
abs(x-y) for unsigned values. =sub_sat_u(sw, x, y)+sub_sat_u(sw, y, x); Alternative: ((x-y) xor (x<y)) - (x<y).
t_simd_base::abs_diff_s(sw,x,y)
abs(x-y) for signed values resulting in unsigned values.
t_simd_base::sgn_diff_u(sw,x,y)
sgn(x-y) for unsigned values resulting in unsigned values.
t_simd_base::sgn_diff_s(sw,x,y)
sgn(x-y) for signed values.
t_simd_base::smul_lo(sw,c,x)
Multiply c with subwords of x yielding the lower bits (scalar multiplication). Suitable for signed and unsigned values.
t_simd_base::smul_hi_u(sw,c,x)
Multiply c with subwords of x yielding the upper bits (scalar multiplication). Unsigned values.
t_simd_base::smul_hi_s(sw,c,x)
Multiply c with subwords of x yielding the upper bits (scalar multiplication). Signed values.
t_simd_base::smule_u(sw,c,x)
Multiply c with even subwords (#0, #2, #4, ...) of x (scalar multiplication). Suitable for unsigned values. sw must not be the maximum value, i.e. there must be at least 2 subwords.
t_simd_base::smule_s(sw,c,x)
Multiply c with even subwords (#0, #2, #4, ...) of x (scalar multiplication). Suitable for signed values. sw must not be the maximum value, i.e. there must be at least 2 subwords.
t_simd_base::mul_lo(sw,x,y)
Multiply corresponding subwords of x and y yielding the lower bits. Suitable for signed and unsigned values. Stupid emulation due to absence of a matching instruction/operation. Emulation of the multiplication usually is far too slow (O(subword size)).

Comparisons

Comparisons returning MSB

t_simd_base::eq0h(sw,x)
Set MSB if x==0.
See HD 6.1 "Find First 0-Byte"
t_simd_base::ne0h(sw,x)
Set MSB if x!=0.
t_simd_base::eqh(sw,x,y)
Set MSB if x==y.
t_simd_base::neh(sw,x,y)
Set MSB if x!=y.
t_simd_base::leh_u(sw,x,y)
Set MSB if x≤y (unsigned). HD 6.1 "Searching for a Value in a Given Range"
t_simd_base::lth_u(sw,x,y)
Set MSB if x<y (unsigned).
t_simd_base::geh_u(sw,x,y)
Set MSB if x≥y (unsigned).
t_simd_base::gth_u(sw,x,y)
Set MSB if x>y (unsigned).
t_simd_base::lt0h_s(sw,x)
Set MSB if x<0 (signed), i.e. propagate high (sign) bit.
t_simd_base::leh_s(sw,x,y)
Set MSB if x≤y (signed).
t_simd_base::lth_s(sw,x,y)
Set MSB if x<y (signed).
t_simd_base::geh_s(sw,x,y)
Set MSB if x≥y (signed).
t_simd_base::gth_s(sw,x,y)
Set MSB if x>y (signed).
t_simd_base::oddh(sw,x)
Set MSB for odd values, i.e. propagate low bit to the left.

Comparisons returning LSB

t_simd_base::eq0l(sw,x)
Set LSB if x==0.
See HD 6.1 "Find First 0-Byte"
t_simd_base::ne0l(sw,x)
Set LSB if x!=0.
t_simd_base::eql(sw,x,y)
Set LSB if x==y.
t_simd_base::nel(sw,x,y)
Set LSB if x!=y.
t_simd_base::lel_u(sw,x,y)
Set LSB if x≤y (unsigned). HD 6.1 "Searching for a Value in a Given Range"
t_simd_base::ltl_u(sw,x,y)
Set LSB if x<y (unsigned).
t_simd_base::gel_u(sw,x,y)
Set LSB if x≥y (unsigned).
t_simd_base::gtl_u(sw,x,y)
Set LSB if x>y (unsigned).
t_simd_base::lt0l_s(sw,x)
Set LSB if x<0 (signed), i.e. propagate high (sign) bit.
t_simd_base::lel_s(sw,x,y)
Set LSB if x≤y (signed).
t_simd_base::ltl_s(sw,x,y)
Set LSB if x<y (signed).
t_simd_base::gel_s(sw,x,y)
Set LSB if x≥y (signed).
t_simd_base::gtl_s(sw,x,y)
Set LSB if x>y (signed).
t_simd_base::oddl(sw,x)
Set LSB for odd values, i.e. propagate low bit to the left.

Comparisons returning masks

t_simd_base::eq0(sw,x)
Set mask denoting x==0.
See HD 6.1 "Find First 0-Byte"
t_simd_base::ne0(sw,x)
Set mask denoting x!=0.
t_simd_base::eq(sw,x,y)
Set mask denoting x==y.
t_simd_base::ne(sw,x,y)
Set mask denoting x!=y.
t_simd_base::le_u(sw,x,y)
Set mask denoting x≤y (unsigned). HD 6.1 "Searching for a Value in a Given Range"
t_simd_base::lt_u(sw,x,y)
Set mask denoting x<y (unsigned).
t_simd_base::ge_u(sw,x,y)
Set mask denoting x≥y (unsigned).
t_simd_base::gt_u(sw,x,y)
Set mask denoting x>y (unsigned).
t_simd_base::lt0_s(sw,x)
Set mask denoting x<0 (signed), i.e. propagate high (sign) bit.
t_simd_base::le_s(sw,x,y)
Set mask denoting x≤y (signed).
t_simd_base::lt_s(sw,x,y)
Set mask denoting x<y (signed).
t_simd_base::ge_s(sw,x,y)
Set mask denoting x≥y (signed).
t_simd_base::gt_s(sw,x,y)
Set mask denoting x>y (signed).
t_simd_base::odd(sw,x)
Set mask denoting odd values, i.e. propagate low bit to the left.

Other operations

t_simd_base::max0_s(sw,x)
Clamp negative signed numbers to 0. =max_s(sw, x, 0)
t_simd_base::max_u(sw,x,y)
Unsigned maximum. =add(sw, sub_sat_u(sw, x, y), y); HD 2.19 "Doz, Max, Min"
t_simd_base::min_u(sw,x,y)
Unsigned minimum. HD 2.19 "Doz, Max, Min"
t_simd_base::max_s(sw,x,y)
Signed maximum.
t_simd_base::min_s(sw,x,y)
Signed minimum.
t_simd_base::abs(sw,x)
Absolute values. The result contains unsigned values, however smallest possible signed number remains as is; the similar function nabs does not have this problem.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
t_simd_base::nabs(sw,x)
Negated absolute values, i.e. -abs(x). The result contains signed values.
See HD 2.18 "Multibyte Add, Subtract, Absolute Value"
t_simd_base::sgn(sw,x)
Sign of x, i.e. <0: -1, =0: 0, >0: 1.
See eq0.
See HD 6.1 "Find First 0-Byte"
t_simd_base::sgn_sat(sw,x)
Sign of x enlarged to saturate. <0: minimum value, =0: 0, >0: maximum value

Shift by common amount

t_simd_base::shl(sw,x,shift)
Shift left, shifting in 0 bits; also known as logical shift. Gives correct results for all values of shift; please note that shift is unsigned.
t_simd_base::shr(sw,x,shift)
Shift right, shifting in 0 bits; also known as logical shift. Gives correct results for all values of shift; please note that shift is unsigned.
t_simd_base::sal(sw,x,shift)
Shift left, duplicating the least significant bit. Gives correct results for all values of shift; please note that shift is unsigned.
t_simd_base::sar(sw,x,shift)
Arithmetically shift right, duplicating the most significant (sign) bit. Gives correct results for all values of shift; please note that shift is unsigned. This is also known as arithmetic shift.
t_simd_base::shl1(sw,x,shift)
Shift left, shifting in 1 bits. Gives correct results for all values of shift; please note that shift is unsigned.
t_simd_base::shr1(sw,x,shift)
Shift right, shifting in 1 bits. Gives correct results for all values of shift; please note that shift is unsigned.
t_simd_base::shl_sat_s(sw,x,shift)
t_simd_base::shl_sat_u(sw,x,shift)
t_simd_base::sal_sat_s(sw,x,shift)
t_simd_base::sal_sat_u(sw,x,shift)
t_simd_base::shl1_sat_s(sw,x,shift)
t_simd_base::shl1_sat_u(sw,x,shift)

Rotate by common amount

t_simd_base::rol(sw,x,rot)
Every (1 << (t_int)(sw)) bits are rotated left. Gives correct results for all values of rot. x: value sw: log_2(#bits), must be ≤ld_bits rot: rotate count
t_simd_base::ror(sw,x,rot)
Every (1 << (t_int)(sw)) bits are rotated right. Gives correct results for all values of rot. x: value sw: log_2(#bits), must be ≤ld_bits rot: rotate count
t_simd_base::rolc(sw,x,rot)
Every (1 << (t_int)(sw)) bits are rotated left complementing the shifted in bits. Gives correct results for all values of rot. x: value sw: log_2(#bits), must be ≤ld_bits rot: rotate count
t_simd_base::rorc(sw,x,rot)
Every (1 << (t_int)(sw)) bits are rotated right complementing the shifted in bits. Gives correct results for all values of rot. x: value sw: log_2(#bits), must be ≤ld_bits rot: rotate count
t_simd_base::rold(sw,l,h,rot)
Rotate every (1 << (t_int)(sw)) bits of l and h where the bit groups of l and h are considered to be one group. This can be used to implement many kinds of shift and rotate. Gives correct results for all values of rot. l, h: value sw: log_2(#bits), must be ≤ld_bits rot: rotate count Example: l=hgfedcba, h=HGFEDCBA, sw=2, rot=1: gfeHcbaD
t_simd_base::rord(sw,l,h,rot)
Rotate every (1 << (t_int)(sw)) bits of l and h where the bit groups of l and h are considered to be one group. This can be used to implement many kinds of shift and rotate. Gives correct results for all values of rot. l, h: value sw: log_2(#bits), must be ≤ld_bits rot: rotate count
t_simd_base::do_rold(sw,l,h,rot)
Variant of rold modifying *l and *h. Rotate every (1 << (t_int)(sw)) bits of *l and *h where the bit groups of l and h are considered to be one group. This can be used to implement many kinds of shift and rotate. Gives correct results for all values of rot. *l, *h: value sw: log_2(#bits), must be ≤ld_bits rot: rotate count Example: l=hgfedcba, h=HGFEDCBA, sw=2, rot=1: gfeHcbaD
t_simd_base::do_rord(sw,l,h,rot)
Variant of rord modifying *l and *h. Rotate every (1 << (t_int)(sw)) bits of *l and *h where the bit groups of l and h are considered to be one group. This can be used to implement many kinds of shift and rotate. Gives correct results for all values of rot. *l, *h: value sw: log_2(#bits), must be ≤ld_bits rot: rotate count

Shift by fieldwise specified amount

t_simd_base::vshl(sw,x,shift)
Field variable variant of shl. Gives correct results for all values of shift; please note that shift values are treated as unsigned.
t_simd_base::vshr(sw,x,shift)
Field variable variant of shr. Gives correct results for all values of shift; please note that shift values are treated as unsigned.
t_simd_base::vsar(sw,x,shift)
Field variable variant of sar. Gives correct results for all values of shift; please note that shift values are treated as unsigned.
t_simd_base::vsal(sw,x,shift)
Field variable variant of sal. Gives correct results for all values of shift; please note that shift values are treated as unsigned.
t_simd_base::vshl1(sw,x,shift)
Field variable variant of shl1. Gives correct results for all values of shift; please note that shift values are treated as unsigned.
t_simd_base::vshr1(sw,x,shift)
Field variable variant of shr1. Gives correct results for all values of shift; please note that shift values are treated as unsigned.
t_simd_base::vshl_sat_s(sw,x,shift)
t_simd_base::vshl_sat_u(sw,x,shift)
t_simd_base::vsal_sat_s(sw,x,shift)
t_simd_base::vsal_sat_u(sw,x,shift)
t_simd_base::vshl1_sat_s(sw,x,shift)
t_simd_base::vshl1_sat_u(sw,x,shift)

Rotate by fieldwise specified amount

t_simd_base::vrol(sw,x,rot)
Field variable variant of rol. Gives correct results for all values of rot.
t_simd_base::vror(sw,x,rot)
Field variable variant of ror. Gives correct results for all values of rot.
t_simd_base::vrolc(sw,x,rot)
Field variable variant of rolc. Gives correct results for all values of rot.
t_simd_base::vrorc(sw,x,rot)
Field variable variant of rorc. Gives correct results for all values of rot.
t_simd_base::vrold(sw,l,h,rot)
Field variable variant of rold. Gives correct results for all values of rot.
t_simd_base::vrord(sw,l,h,rot)
Field variable variant of rord. Gives correct results for all values of rot.
t_simd_base::do_vrold(sw,l,h,rot)
Variant of vrold modifying *l and *h. Gives correct results for all values of rot.
t_simd_base::do_vrord(sw,l,h,rot)
Variant of vrord modifying *l and *h. Gives correct results for all values of rot.

Special functions

t_simd_base::gray_code(sw,x)
See Hacker's Delight, 13.1 "Gray Code"
x ^ (x >> 1)
t_simd_base::inv_gray_code(sw,x)
See Hacker's Delight, 13.1 "Gray Code"
t_simd_base::tbm(sw,x,mode)
General trailing bit modification operations. Mode: The operating mode, see t_tbm_mode. For whole words: All of these operations can be performed by ≤ 3 instructions. Some of these operations are realized as BMI1 or TBM instructions.

Horizontal math

t_simd_base::hand(sw,x)
AND of neighboring subwords. When processing a loop i in 0..(t_int)(sw): subword == -1.
t_simd_base::hor(sw,x)
OR of neighboring subwords. When processing a loop i in 0..(t_int)(sw): subword != 0.
t_simd_base::hxor(sw,x)
XOR of neighboring subwords. When processing a loop i in 0..(t_int)(sw): Gray code / Parity.
t_simd_base::hadd_u(sw,x)
Sum of neighboring unsigned subwords.
See nr_1bits.
t_simd_base::hadd_s(sw,x)
Sum of neighboring signed subwords.
t_simd_base::hsub_u(sw,x)
Difference between neighboring unsigned subwords giving signed results. Calculate lower - higher subword for all pairs.
t_simd_base::hsub_s(sw,x)
Difference between neighboring signed subwords giving signed results. Calculate lower - higher subword for all pairs.
t_simd_base::havgd_u(sw,x)
Unsigned average rounding down, i.e. (x+y)>>1, of neighboring unsigned subwords.
t_simd_base::havgu_u(sw,x)
Unsigned average rounding up, i.e. (x+y+1)>>1, of neighboring unsigned subwords.
t_simd_base::hmax_u(sw,x)
Maximum between neighboring unsigned subwords. HD 2.19 "Doz, Max, Min" max(a, b) = max(0, a-b)+b
t_simd_base::hmin_u(sw,x)
Minimum between neighboring unsigned subwords. HD 2.19 "Doz, Max, Min" min(a, b) = a-max(0, a-b)
t_simd_base::hmax_s(sw,x)
Maximum between neighboring signed subwords. HD 2.19 "Doz, Max, Min" max(a, b) = max(0, a-b)+b
t_simd_base::hmin_s(sw,x)
Minimum between neighboring signed subwords. HD 2.19 "Doz, Max, Min" min(a, b) = a-max(0, a-b)
t_simd_base::hmovzx(sw,x)
Zero extend even (lower) elements.
t_simd_base::hmovsx(sw,x)
Sign extend even (lower) elements.
t_simd_base::hmul_u(sw,x)
Product of neighboring unsigned subwords. Stupid emulation due to absence of a matching instruction/operation.
t_simd_base::hmul_s(sw,x)
Product of neighboring signed subwords. Stupid emulation due to absence of a matching instruction/operation.

Shuffle operations

t_simd_base::butterfly(sw,x,m)
One butterfly step/stage. sw: 0..ld_bits-1 m & a_bfly_mask[sw] should be == m
t_simd_base::bit_index_complement(x,k)
See ARM System Developer's Guide, 7.6.2 "Bit Permutations" as used in the loop of general_reverse_bits.
t_simd_base::bit_index_swap(x,j,k)
See ARM System Developer's Guide, 7.6.2 "Bit Permutations"
See shuffle, unshuffle
t_simd_base::bit_index_swap_complement(x,j,k)
See ARM System Developer's Guide, 7.6.2 "Bit Permutations"
t_simd_base::bit_index_ror(x,rot,ofs,field)
Rotate an bit index field to the right by rot by executing the necessary calls to bit_index_swap. q+field+ofs=ld_bits q: upper bit indexes (unchanged) field: size of the affected bit string ofs: lower bit indexes (unchanged)
t_simd_base::transpose(x,ld_fields,ld_col,ld_row)
Transpose bit matrixes. ld_fields: width of the bit fields ld_col: ld(bit columns) ld_row: ld(bit rows)
t_simd_base::shuffle_power(x,sw1,sw2,pwr)
pwr times shuffle(x, sw1, sw2)
See Hacker's Delight, 7.6/7.8 "Rearrangements and Index Transformations"
t_simd_base::unshuffle_power(x,sw1,sw2,pwr)
pwr times unshuffle(x, sw1, sw2)
See Hacker's Delight, 7.6/7.8 "Rearrangements and Index Transformations"
t_simd_base::general_reverse_bits(x,k)
Swap all subwords of given levels.
See Hacker's Delight, 7.1 "Generalized Bit Reversal"
k: set of t_subword, i.e. one bit per subword size. This is a candidate for a new instruction.
t_simd_base::bswap(x)
Exchange byte order.
This can be expressed in assembler:
bits = 8: n/a
bits = 16: "xchg al, ah" or "rol ax, 16"
bits = 32: "bswap eax"
bits = 64: "bswap rax"
bits = 128: "xchg rax, rdx; bswap rax; bswap rdx"
t_simd_base::prim_swap(x,m)
Swap by primitives. Generalization of general_reverse_bits.
See "Matters Computational" by Joerg Arndt, "A restricted method"
See Hacker's Delight, 7.1 "Generalized Bit Reversal"
t_simd_base::shuffle(x,sw1,sw2)
Shuffle/zip/interlace entities.
See Hacker's Delight, 7.2 "Shuffling Bits"
sw1: log_2(subword_length): entities to move
sw2: log_2(word_length): moving area
0 ≤ sw1 < sw2 ≤ ld_bits
Example: sw1=2, sw2=5: Shuffle nibbles in dword = shuffle_power(x, sw1, sw2, 1)
t_simd_base::unshuffle(x,sw1,sw2)
Unshuffle/unzip/uninterlace entities.
See Hacker's Delight, 7.2 "Shuffling Bits"
sw1: log_2(subword_length): entities to move
sw2: log_2(word_length): moving area
0 ≤ sw1 < sw2 ≤ ld_bits
Example: sw1=0, sw2=3: Unshuffle bits in bytes = unshuffle_power(x, sw1, sw2, 1)

CEF operations (via aux object)

t_simd_base::compress_flip_right(sw,x,m)
t_simd_base::expand_flip_right(sw,x,m)
t_simd_base::compress_flip_left(sw,x,m)
t_simd_base::expand_flip_left(sw,x,m)

CE operations (via aux object)

t_simd_base::compress_right(sw,x,m)
t_simd_base::expand_right(sw,x,m)
t_simd_base::compress_left(sw,x,m)
t_simd_base::expand_left(sw,x,m)

`t_bfly_base`

This structure is used to hold the configuration of butterfly-based operations as well as compress and expand.

t_bfly_base::bfly(x)
Apply butterfly configured network on x
t_bfly_base::ibfly(x)
Apply inverse butterfly configured network on x
t_bfly_base::bfly_parity()
Return the parity of a permutation given by a butterfly network. This is false for even parity and true for odd parity.

`t_cef_base`

Compress/Expand flip left/right.

t_cef_base::gen_cef_right(sw,m)
Scatter/gather-flip, compress/expand+flip. Generate configuration for [inverse] butterfly network. To compress use ibfly. To expand use bfly.
t_cef_base::gen_cef_left(sw,m)
Scatter/gather-flip, compress/expand+flip. Generate configuration for [inverse] butterfly network. To compress use ibfly. To expand use bfly.
t_cef_base::gen_right(sw,m)
t_cef_base::gen_left(sw,m)
t_cef_base::compress(x)
t_cef_base::expand(x)

`t_ce_right_base`

Compress/Expand right, rest filled with 0.

t_ce_right_base::gen_ce_right(sw,m)
See Hacker's Delight, 7.4 "Compress, or Generalized Extract"
See Hacker's Delight, 7.5 "Expand, or Generalized Insert"
To compress use apply_compress_right. To expand use apply_expand_right.
t_ce_right_base::apply_compress_right(x)
See Hacker's Delight, 7.4 "Compress, or Generalized Extract"
This should be configured by gen_ce_right.
t_ce_right_base::apply_expand_right(x)
See Hacker's Delight, 7.4 "Compress, or Generalized Extract"
See Hacker's Delight, 7.5 "Expand, or Generalized Insert"
This should be configured by gen_ce_right.
(a & b) | (~a & c) See ((b ^ c) & a) ^ c
t_ce_right_base::gen(sw,m)
t_ce_right_base::compress(x)
t_ce_right_base::expand(x)

`t_ce_left_base`

Compress/Expand left, rest filled with 0.

t_ce_left_base::gen_ce_left(sw,m)
See Hacker's Delight, 7.4 "Compress, or Generalized Extract"
See Hacker's Delight, 7.5 "Expand, or Generalized Insert"
To compress use apply_compress_left. To expand use apply_expand_left.
t_ce_left_base::apply_compress_left(x)
See Hacker's Delight, 7.4 "Compress, or Generalized Extract"
This should be configured by gen_ce_left.
t_ce_left_base::apply_expand_left(x)
See Hacker's Delight, 7.4 "Compress, or Generalized Extract"
See Hacker's Delight, 7.5 "Expand, or Generalized Insert"
This should be configured by gen_ce_left.
(a & b) | (~a & c) See ((b ^ c) & a) ^ c
t_ce_left_base::gen(sw,m)
t_ce_left_base::compress(x)
t_ce_left_base::expand(x)

`t_vrot_base`

Faster variant of t_simd_base::vrol/vror when using the same rotate counts repeatedly.

t_vrot_base::gen_vrol(sw,rot)
Field variable rotate. Generate configuration for [inverse] butterfly network. Simulate rolc for every subword of the masks.
t_vrot_base::vrol(x)
t_vrot_base::vror(x)

Index

Here is a list of words, types and functions and where you find them.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

abs, t_simd_base::
abs_diff_s, t_simd_base::
abs_diff_u, t_simd_base::
add1, t_simd_base::
add1_sat_s, t_simd_base::
add1_sat_u, t_simd_base::
add, t_simd_base::
add_sat_s, t_simd_base::
add_sat_u, t_simd_base::
addc, t_simd_base::
addc_sat_s, t_simd_base::
addc_sat_u, t_simd_base::
apply_compress_left, t_ce_left_base::
apply_compress_right, t_ce_right_base::
apply_expand_left, t_ce_left_base::
apply_expand_right, t_ce_right_base::
avgc_s, t_simd_base::
avgd_u, t_simd_base::
avgf_s, t_simd_base::
avgu_u, t_simd_base::
bfly, t_bfly_base::
bfly_parity, t_bfly_base::
bit_index_complement, t_simd_base::
bit_index_ror, t_simd_base::
bit_index_swap, t_simd_base::
bit_index_swap_complement, t_simd_base::
bit_permute_step2, t_bits_base::
bit_permute_step2_rot, t_bits_base::
bit_permute_step, t_bits_base::
bit_permute_step_rot, t_bits_base::
bit_permute_step_simple, t_bits_base::
bswap, t_simd_base::
butterfly, t_simd_base::
ccl_mul, t_bits_base::
ccl_mul_ex, t_bits_base::
ccl_mul_inv, t_bits_base::
ccl_mul_inv_ex, t_bits_base::
ccl_power, t_bits_base::
ccl_power_ex, t_bits_base::
cl_mat_inv, t_bits_base::
cl_mul, t_bits_base::
cl_mul_inv, t_bits_base::
cl_power, t_bits_base::
compress, t_cef_base::
compress, t_ce_left_base::
compress, t_ce_right_base::
compress_flip_left, t_simd_base::
compress_flip_right, t_simd_base::
compress_left, t_simd_base::
compress_right, t_simd_base::
constant, t_simd_base::
contains_0, t_simd_base::
dec, t_simd_base::
do_rold, t_simd_base::
do_rord, t_simd_base::
do_vrold, t_simd_base::
do_vrord, t_simd_base::
element_mask, t_simd_base::
eq0, t_simd_base::
eq0h, t_simd_base::
eq0l, t_simd_base::
eq, t_simd_base::
eqh, t_simd_base::
eql, t_simd_base::
expand, t_cef_base::
expand, t_ce_left_base::
expand, t_ce_right_base::
expand_flip_left, t_simd_base::
expand_flip_right, t_simd_base::
expand_left, t_simd_base::
expand_right, t_simd_base::
extract, t_simd_base::
gen, t_ce_left_base::
gen, t_ce_right_base::
general_reverse_bits, t_simd_base::
gen_cef_left, t_cef_base::
gen_cef_right, t_cef_base::
gen_ce_left, t_ce_left_base::
gen_ce_right, t_ce_right_base::
gen_left, t_cef_base::
gen_right, t_cef_base::
gen_vrol, t_vrot_base::
ge_s, t_simd_base::
ge_s, t_simd_base::
ge_s, t_simd_base::
ge_u, t_simd_base::
ge_u, t_simd_base::
ge_u, t_simd_base::
gray_code, t_bits_base::
gray_code, t_simd_base::
gt_s, t_simd_base::
gth_s, t_simd_base::
gtl_s, t_simd_base::
gt_u, t_simd_base::
gth_u, t_simd_base::
gtl_u, t_simd_base::
hadd_s, t_simd_base::
hadd_u, t_simd_base::
hand, t_simd_base::
havgd_u, t_simd_base::
havgu_u, t_simd_base::
hmax_s, t_simd_base::
hmax_u, t_simd_base::
hmin_s, t_simd_base::
hmin_u, t_simd_base::
hmovsx, t_simd_base::
hmovzx, t_simd_base::
hmul_s, t_simd_base::
hmul_u, t_simd_base::
hor, t_simd_base::
hsub_s, t_simd_base::
hsub_u, t_simd_base::
hxor, t_simd_base::
ibfly, t_bfly_base::
implant, t_simd_base::
inc, t_simd_base::
inv_gray_code, t_bits_base::
inv_gray_code, t_simd_base::
isign, t_bits_base::
is_contiguous_1bits, t_bits_base::
is_parity_odd, t_bits_base::
is_parity_oddh, t_simd_base::
is_parity_oddl, t_simd_base::
le_s, t_simd_base::
leh_s, t_simd_base::
lel_s, t_simd_base::
le_u, t_simd_base::
leh_u, t_simd_base::
lel_u, t_simd_base::
lt0_s, t_simd_base::
lt0h_s, t_simd_base::
lt0l_s, t_simd_base::
lt_s, t_simd_base::
lth_s, t_simd_base::
ltl_s, t_simd_base::
lt_u, t_simd_base::
lth_u, t_simd_base::
ltl_u, t_simd_base::
mask_ex, t_bits_base::
max0_s, t_simd_base::
max_s, t_simd_base::
max_u, t_simd_base::
min_s, t_simd_base::
min_u, t_simd_base::
mul_inv, t_bits_base::
mul_lo, t_simd_base::
M_abs
M_abs_diff
M_gcd
M_max
M_min
M_odd
M_sgn
M_sgn_diff
nabs, t_simd_base::
ne0, t_simd_base::
ne0h, t_simd_base::
ne0l, t_simd_base::
ne, t_simd_base::
neh, t_simd_base::
nel, t_simd_base::
neg, t_simd_base::
neg_sat_s, t_simd_base::
nr_0bits, t_bits_base::
nr_0bits, t_simd_base::
nr_1bits, t_bits_base::
nr_1bits, t_simd_base::
nr_leading_0bits, t_bits_base::
nr_leading_0bits, t_simd_base::
nr_leading_1bits, t_bits_base::
nr_leading_1bits, t_simd_base::
nr_trailing_0bits, t_bits_base::
nr_trailing_0bits, t_simd_base::
nr_trailing_1bits, t_bits_base::
nr_trailing_1bits, t_simd_base::
nr_trailing_ne0, t_simd_base::
odd, t_simd_base::
overflow_add1_s, t_simd_base::
overflow_add1_u, t_simd_base::
overflow_add_s, t_simd_base::
overflow_add_u, t_simd_base::
overflow_addc_s, t_simd_base::
overflow_addc_u, t_simd_base::
overflow_sub1_s, t_simd_base::
overflow_sub1_u, t_simd_base::
overflow_sub_s, t_simd_base::
overflow_sub_u, t_simd_base::
overflow_subc_s, t_simd_base::
overflow_subc_u, t_simd_base::
prim_swap, t_simd_base::
rol, t_bits_base::
rol, t_simd_base::
rolc, t_simd_base::
rold, t_simd_base::
rol_ex, t_bits_base::
ror, t_bits_base::
ror, t_simd_base::
rorc, t_simd_base::
rord, t_simd_base::
ror_ex, t_bits_base::
sal, t_simd_base::
sal_ex, t_bits_base::
sal_sat_s, t_simd_base::
sal_sat_u, t_simd_base::
sar, t_simd_base::
sar_ex, t_bits_base::
sar_fast, t_bits_base::
sar_safe, t_bits_base::
sgn, t_simd_base::
sgn_diff_s, t_simd_base::
sgn_diff_u, t_simd_base::
sgn_sat, t_simd_base::
shl1, t_simd_base::
shl1_ex, t_bits_base::
shl1_sat_s, t_simd_base::
shl1_sat_u, t_simd_base::
shl, t_simd_base::
shl_ex, t_bits_base::
shl_fast, t_bits_base::
shl_safe, t_bits_base::
shl_safe, t_bits_base::
shl_sat_s, t_simd_base::
shl_sat_u, t_simd_base::
shr1, t_simd_base::
shr1_ex, t_bits_base::
shr, t_simd_base::
shr_ex, t_bits_base::
shr_fast, t_bits_base::
shr_safe, t_bits_base::
shuffle, t_simd_base::
shuffle_power, t_simd_base::
sign_extend, t_bits_base::
smul_hi_s, t_simd_base::
smul_hi_u, t_simd_base::
smul_lo, t_simd_base::
sub1, t_simd_base::
sub1_sat_s, t_simd_base::
sub1_sat_u, t_simd_base::
sub, t_simd_base::
sub_sat_s, t_simd_base::
sub_sat_u, t_simd_base::
subc, t_simd_base::
subc_sat_s, t_simd_base::
subc_sat_u, t_simd_base::
tbm, t_bits_base::
tbm, t_simd_base::
transpose, t_simd_base::
t_bfly_base
t_bits_base
t_cef_base
t_ce_left_base
t_ce_right_base
t_round_mode
t_simd_128
t_simd_16
t_simd_32
t_simd_64
t_simd_8
t_simd_base
t_subword
t_tbm_mode
t_vrot_base
unshuffle, t_simd_base::
unshuffle_power, t_simd_base::
vrol, t_simd_base::
vrol, t_vrot_base::
vrolc, t_simd_base::
vrold, t_simd_base::
vror, t_simd_base::
vror, t_vrot_base::
vrorc, t_simd_base::
vrord, t_simd_base::
vsal, t_simd_base::
vsal_sat_s, t_simd_base::
vsal_sat_u, t_simd_base::
vsar, t_simd_base::
vshl1, t_simd_base::
vshl1_sat_s, t_simd_base::
vshl1_sat_u, t_simd_base::
vshl, t_simd_base::
vshl_sat_s, t_simd_base::
vshl_sat_u, t_simd_base::
vshr1, t_simd_base::
vshr, t_simd_base::
xsign, t_bits_base::

You may bookmark this page as http://programming.sirrida.de?swar.html.
Last change: 2016-07-02

SIMD/SWAR

Contents

Intro

Overview

Concrete classes

General functions which ought to be in std::

t_subword

t_tbm_mode

t_round_mode

t_bits_base

Rotating

Shifting

Counting bits

Miscellaneous

Permuting

Multiplication

Carry-less multiplication

Cyclic carry-less multiplication

Operations on low bits with specified bit count

t_simd_base

Basic functions

Count bits/elements

Math

Comparisons

Shift by common amount

Rotate by common amount

Shift by fieldwise specified amount

Rotate by fieldwise specified amount

Special functions

Horizontal math

Shuffle operations

CEF operations (via aux object)

CE operations (via aux object)

t_bfly_base

t_cef_base

t_ce_right_base

t_ce_left_base

t_vrot_base

Index

General functions which ought to be in `std::`

`t_subword`

`t_tbm_mode`

`t_round_mode`

`t_bits_base`

`t_simd_base`

`t_bfly_base`

`t_cef_base`

`t_ce_right_base`

`t_ce_left_base`

`t_vrot_base`