Best method found: Reordered Beneš network (about 40 cycles on superscalar processors):
x = bit_permute_step(x, 0x44104014, 1); // Butterfly, stage 0 x = bit_permute_step(x, 0x07040804, 4); // Butterfly, stage 2 x = bit_permute_step(x, 0x32012200, 2); // Butterfly, stage 1 x = bit_permute_step(x, 0x00001bdb, 16); // Butterfly, stage 4 x = bit_permute_step(x, 0x00660026, 8); // Butterfly, stage 3 x = bit_permute_step(x, 0x01303231, 2); // Butterfly, stage 1 x = bit_permute_step(x, 0x03030204, 4); // Butterfly, stage 2 x = bit_permute_step(x, 0x51041445, 1); // Butterfly, stage 0
See documentation to
bit_permute_step,
bit_permute_step_simple,
rol.
pext and pdep
can be emulated with
compress_right
and
expand_right.
This result is not necessarily the best possible,
but at least several methods have been challenged.
The given cycles are only estimated and may vary significantly
depending on the used processor.
Thus, the selected method might not be the best one for your application.
You can however influence the choice by using the options above.
See also some
notes
on the inner workings.
There is an even better permutation code calculator
calcperm.*
which is usable for various word sizes
(Pascal and C++ sources).
Error reports, comments or questions? E-mail: info@sirrida.de