Best method found: Bit group moving (about 41 cycles on superscalar processors):
x = (x & 0x04000000) | ((x & 0x10008000) << 1) | ((x & 0x00081000) << 3) | rol(x & 0x40800001, 7) | ((x & 0x00000008) << 8) | ((x & 0x00040010) << 9) | rol(x & 0x2a004000, 11) | ((x & 0x00000080) << 12) | ((x & 0x00000020) << 13) | rol(x & 0x00100200, 14) | ((x & 0x00000040) << 15) | ((x & 0x00000100) << 16) | ((x & 0x00002000) << 18) | rol(x & 0x00400002, 19) | ((x & 0x00000400) >> 9) | ((x & 0x00000800) >> 8) | ((x & 0x01200000) >> 7) | ((x & 0x00010000) >> 6) | ((x & 0x00020000) >> 5) | ((x & 0x80000000) >> 3) | ((x & 0x00000004) >> 2);
See documentation to
pext and pdep can be emulated with compress_right and expand_right.
This result is not necessarily the best possible, but at least several methods have been challenged. The given cycles are only estimated and may vary significantly depending on the used processor. Thus, the selected method might not be the best one for your application. You can however influence the choice by using the options above.
See also some notes on the inner workings.
There is an even better calculator calcperm.* which is usable for various word sizes (Pascal and C++ sources).
Error reports, comments or questions? E-mail: email@example.com