My proposals for new x86 instructions

Push effective address
Mov small constant
Extended setcc
Gather: rep xlat
Gather: Lab translate
Gather and scatter with configurable factor
Bit permutations
Bit permutation primitive
Other ideas

I have posted most these topics also in Intel's AVX and CPU Instructions forum; this is my view of the discussion.

Comments or questions? E-mail: info@sirrida.de

Push effective address

Let me propose a new command: Push effective address: pushea.

For example in 32 bit mode "pushea [ea]" is meant as a replacement for the combination "lea eax,[ea]" / "push eax" but without trashing eax.

A possible encoding could be f2 ff /6.

Mov small constant

Discussed in the forum.

Let me propose a replacement encoding for loading tiny values via mov: "mov reg,const".

By utilizing the unused and reserved form "lea reg,reg" (8d 11rrrxxx) I propose the following meaning:

rrr is the target register and xxx shall be encoded as

xxx	Meaning
000	A signed byte follows, i.e. -128…127
001	1
010	2
011	Maximum signed value (for 32 bit: 0x7fffffff)
100	A signed word follows, i.e. -32768…32767
101	Minimum signed value (for 32 bit: 0x80000000)
110	-2
111	-1

Examples:

"mov eax,2": 8d c2 (2 bytes)
"mov eax,5": 8d c0 05 (3 bytes)
"mov eax,32767": 8d c4 ff 7f (4 bytes)

All these encodings are better than b8 xx xx xx xx (5 bytes).

Needless to say that the usual prefixes for word and qword should work as well. Example:

"mov rax,-2": 48 8d fe (3 bytes)

instead of

48 c7 c0 fe ff ff ff (7 bytes)

The need to have denser encoding seems to be there since I have sometimes seen replacements for assigning small constants:

0: xor reg,reg (same register, 2 bytes): Specially optimized by processor! This is the reason I left out 0.
-1: or reg,-1 (3 bytes)
-128…127: "push imm8 / pop reg" (3 bytes): Obviously very slow because of accessing memory.

The special case of 0 is very common and (almost?) always treated specially. In my experience assignments of small constants to register occur quite often.

Nearly all the arithmetic commands have a special notation for 8 bit immediate values: and,or,xor,cmp,add,adc,sub,sbb - and there is inc/dec…

My proposal does not break compatibility since a coding is used which traps as an illegal opcode.

Extended setcc

Discussed in the forum, slightly modified.

Let me propose an extension to the command setcc.

Often we need booleans not as bytes with the values 0 and 1…

By utilizing the 3 currently unused bits I propose the following meaning:

Bit

Meaning

Size: 0=byte, 1=dword

#1…#0

Bits	false	true	Keyword	Comments
00	0	1	-	Classical case as in C or Pascal
01	0	-1	mask	For e.g. bit masks
10	-1	1	sign	Sign
11	(no change)	0	zero	Keep or zero out
11	NC	C	-	Reset or set carry flag

As a special case for the combination 111 and the coding as for target register esp only the carry flag should be set accordingly.

Needless to say that the usual prefixes for word and qword should work as well.

An encoding as 0f 9x /y would be OK if the superfluous bits (reg field) are assumed 0, but they have not (yet?) been documented as reserved…

An encoding without this problem could therefore be this: f2 0f 9x /y.

I give some more or less silly and untested examples, however I am not yet sure about the concrete syntax for the new forms. See an idea below.

There are a lot of cases where a programmer defines a boolean type e.g. as int:

typedef int bool;

In this case the statement "bool b = (x==17)" can be expressed by

cmp [x],17
sete dword ptr [b]  // the form false=>0, true=>1; dword

as a replacement for

cmp [x],17
sete al
movzx eax,al
mov dword ptr [b],eax

The Windows API typically uses these booleans named BOOL.

The sgn function:

int sgn(int x) {
  if (x<0)
    return -1;
  else if (x>0)
    return 1;
  else
    return 0;
  }

can be written as

test eax,eax
setg eax,sign  // the form false=>-1, true=>1; dword
setz eax,zero  // the form false=>unchanged, true=>0; dword

as a replacement for

cdq
neg eax
shr eax,31
or eax,edx

The Visual Basic boolean data type as well as Delphi's longbool/wordbool/bytebool defines false as 0 and true as -1. These boolean values can simply generated with e.g.

cmp …
setcc eax,mask  // the form false=>0, true=>-1; dword

as a replacement for

cmp …
setcc al
movzx eax,al
neg eax

If the value must be written to memory:

cmp …
setcc dword ptr [x],mask  // the form false=>0, true=>-1; dword

as a replacement for

cmp …
setcc al
movzx eax,al
neg eax
mov dword ptr [x],eax

The fragment

q = (q << 1) + (x > 17);

can be written as

cmp [x],17
setg  // the form where only the carry flag is set
adc eax,eax  // assuming that q is in register eax

cmp [x],17
setg edx  // the form false=>0, true=>1; dword
lea eax,[eax*2+edx]  // assuming that q is in register eax

as a replacement for

cmp [x],17
setg dl
movzx edx,dl
lea eax,[eax*2+edx]  // assuming that q is in register eax

The signed average, i.e. (a+b)/2, can be written as

add eax,edx
setl edx,mask  // the form false=>0, true=>-1; dword
sub eax,edx
rcr eax,1

as a replacement for

xor ecx,ecx
add eax,edx
setnl cl
dec ecx
sub eax,ecx
rcr eax,1

Gather: rep xlat

Posted as a comment.

It would be fine if at least the gather command were there; in my experience the scatter command is not used as often.

Programming such endless chains of insertxx commands is ugly, stupid, slow, space-inefficient and really should be unnecessary. All these commands must be decoded and executed, and needlessly fill the memory and μop caches. Gather commands could at least generate the flood of μops by themselves and the out-of-order mechanism will care for the proper interlacing with other commands which can be executed in parallel (superscalar). I know that such a command will have a long latency.

…and things get even worse if one wants to gather small entities such as bytes. I have often missed commands such as "rep xlat" ("jecxz end; redo: lodsb; xlat; stosb; loop redo; end:" with al preserved). If gather was there, it could be simulated with some code around "gather.byte xmm1,[ea]" where the bytes of xmm1 hold unsigned offsets to ea (effective memory address).

By the way: I have never understood why e.g. the handy commands "rep stosb", "rep movsb", "jecxz", "loop", "enter" (level 0) and "leave" can be outperformed by any replacement stuff. Modern CPUs ought to be smart enough to execute such dedicated commands much faster. It is a shame that they do not!

Let us get concrete. Here is a follow-up in the forum:
I heartily wish the string commands "rep stosb", "rep movsb", "rep cmpsb" and "rep scasb" (yes, especially the byte operations) were so fast and smart that all the many replacement routines would be outperformed and thus obsoleted. See e.g. Agner Fog's subroutine library for examples of such routines and the necessary effort to choose the right one.

You will find my final proposal of the gather command in Gather and scatter with configurable factor.

Gather: Lab translate

Posted as a comment.

Here is a real world example usage for byte/word scatter/gather commands:

Assuming an image source of L*a*b pixels, each 3*8 bit, we want to perform a gamma correction on L and a color correction (e.g. gray balance) on a and b.

For the gamma correction there is a byte array: unsigned char L_tab[256].
For the color correction there is a word array: unsigned short ab_tab[65536]

Now we face the first problem: AoS to SoA.
We might load 3 xmm registers and perform some pshufb and palignr commands. Doing so a variant of pshufb with at least 2 sources is sourly missing, such as AMD's new xop command vpperm.
A byte gather command comes handy here, and in this case AoS to SoA can be done in 3 commands if a 3 operand form is available. Here are the needed offsets for xmm usage:

L_idx:  db  0,  3,  6,  9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45
ab_idx: db  1,  2,  4,  5,  7,  8, 10, 11, 13, 14, 16, 17, 19, 20, 22, 23
        ;  25, 26, 28, 29, 31, 32, 34, 35, 37, 38, 40, 41, 43, 44, 46, 47

The next problem is the actual lookup for L and ab. With byte gather for L and word gather for ab this is easy.

Finally we want to transform SoA back to AoS. Here we obviously need scatter commands doing the reverse of AoS to SoA.
In this case we could emulate it by storing to temporary memory, use gather operations to swizzle and finally store to target.

Here is a coding example using xmm registers. The extension to ymm or zmm is simple.

  movdqu xmm3,[L_idx]
  movdqu xmm4,[ab_idx_1]
  lea eax,[src]
  lea edx,[tgt]
  lea esi,[L_tab]
  lea edi,[ab_tab]
@@loop:
  vgather.byte xmm0,xmm3,[eax]     ; AoS to SoA: load L values
  vgather.byte xmm1,xmm4,[eax]     ; AoS to SoA, load ab values, part 1
  vgather.byte xmm2,xmm4,[eax+24]  ; AoS to SoA, load ab values, part 2
  add eax,3*16
  gather.byte xmm0,[esi]  ; translate L
  gather.word xmm1,[edi]  ; translate ab, part 1
  gather.word xmm2,[edi]  ; translate ab, part 2
  vscatter.byte [edx],   xmm0,xmm3  ; SoA to AoS, store L values
  vscatter.byte [edx],   xmm1,xmm4  ; SoA to AoS, store ab values, part 1
  vscatter.byte [edx+24],xmm2,xmm4  ; SoA to AoS, store ab values, part 2
  add edx,3*16
  dec ecx
  jnz @@loop

The corresponsing scalar code looks like this:

  lea eax,[src]
  lea edx,[tgt]
  lea esi,[L_tab]
  lea edi,[ab_tab]
@@loop:
  movzx ebx,byte ptr [eax]    ; load L value
  movzx ebp,word ptr [eax+1]  ; load a and b value
  add eax,3
  mov bl,[ebx+esi]  ; translate L


  mov si,[ebp+edi]  ; translate ab
  mov [edx],bl  ; store L value
  mov [edx+1],si  ; store a and b value
  add edx,3
  dec ecx
  jnz @@loop

You will find my final proposal of the gather command in Gather and scatter with configurable factor.

Gather and scatter with configurable factor

Here is my view of the summary of the thread Request for extension to scatter/gather.

As an extension to Gather: rep xlat and Gather: Lab translate I would like to propose yet another variant of a gather instruction capable of fetching from a column of a table with an arbitrary width. The stride (table width in bytes) need not be a multiple of the element size.
The need arises because at least for bytes and words the possible indexes are too much limited to perform the given task.
The indexes (line numbers) of the elements need not be ascending from 0 up but shall be stored in the corresponding elements of an SSE or AVX register.
What we need is a scatter/gather command like this:

gather.word result, indexes, base_addr, stride

These are the parameters:

result: SSE/AVX register containing the result. This could alternatively or optionally be equal to indexes.
indexes: SSE/AVX register containing the indexes.
base_addr: Effective address working as a base address.
stride: General purpose register such as ecx or rax holding the table width in bytes.
The special case esp or rsp could mean that the natural stride value, i.e. the element width (1 for bytes and 2 for words, etc.), will be applied instead of the value of a register as seen in the two proposals above.

Example: gather.word ymm0, ymm1, [rsi+rbx*4+0x12345], ecx

The functionality (e.g. for words):

for all i do
  result.word[i] := word ptr [EA + indexes.word[i]*stride]

The stride parameter would be an extension to my original proposal of byte/word scatter/gather.

Other options may include

A flag whether offsets in the indexes register are signed or unsigned.
A flag whether a mask can be specified within the indexes, e.g. the highest bit being set or a special value such as -1.
A masked value will not be loaded from memory.
A flag whether masked values are left as is or filled with zero. Only sensible for gather where result and indexes are different.

Similarly a scatter command could be defined as well.

As for the coding of the command: The AVX coding scheme allows for an extra byte where the stride register as well as 4 flags can be specified.

Bit permutations

I would like to see a instruction for bit permutations. The proposal for similar things as my proposal is not new. You can find e.g. numerous articles on the web from Yedidya Hilewitz and Ruby B. Lee about butterfly operations and an implementation in hardware, e.g. here.

The syntax could be as follows:

advanced_bit_op target, source, mask, code

The arguments target and source shall be a register, mask a register or a memory address, and code an 8 bit constant.
This is perfect for AVX, i.e. VEX encoded instructions. If needed target and source can be merged into one register, e.g. for an SSE variant. Alternatively or additionally there can be an implementation for general-purpose registers.

The actual function shall be encoded in code. The heart of these functions shall be a butterfly operation. For future extensions this might not be true.

code is split into sw and op, each 4 bits.

sw is the binary logarithm (log₂) of the subword size, e.g. 0=bit, 3=byte, 7=xmm register; values greater than usable shall be reserved and must not be used.

op is the function type to be performed. The highest bit in op shall denote the inverse operation.
In my example encoding 0…4 are inverse butterfly operations and 8…c are the corresponding butterfly operations:

`op`	Meaning
0	compress_right
1	compress_left
2	compress_right_flip
3	compress_left_flip
4	Explicit inverse butterfly steps at levels `sw` and `sw`+1
5…7	Reserved
8	expand_right
9	expand_left
a	expand_right_flip
b	expand_left_flip
c	explicit butterfly steps at levels `sw`+1 and `sw`
d…f	Reserved

For the explicit butterfly step operations (cases 4 and c) even values of sw suffice.
Since a butterfly step only needs half of the bits of mask as the configuration, two such configurations can be packed into mask.

The operations manageable by the [inverse] butterfly network and needing an expensive calculation for the configuration should hold the configuration in a cache where an entry can be identified by

mask: same size as a register
op: 1 bit. The functions share the same configuration for their inverse. The configuration for "explicit butterfly steps" need not be cached. Since compress and expand can be emulated with their flip siblings, they should map to the same configuration cache entry. For the current proposal the single needed bit means right/left.
sw: ≤ 4 bit. 3 bits are enough for SSE (XMM register; log₂(128) = 7; 7 fits in 3 bits).

The configuration itself needs log₂(register_size)*register_size/2 bits.
For SSE (XMM register) we therefore need 128 + 1 + 3 + 7*128/2 = 580 bits for one cache entry; for AVX (YMM register) it will be 256 + 1 + 4 + 8*256/2 = 1285 bits.

The configuration cache and both the butterfly network and the inverse one shall be present on each core. The machinery for calculating the mask might be once per die, possibly micro programmed; it need not be very fast due to the cache. The cache need not be saved across thread switches and can even be used by any thread due to its universal character. There can be more than one cache entry. Butterfly operations should run in 1 cycle.

This proposal seems to be partially obsolete now. In the document 319433-011 (June 2011) the instructions compress_right and expand_right are introduced (proposed for 2013, released about 2013-06) by Intel as "Haswell New Instructions" and are named PEXT and PDEP (BMI2 instruction set). They act on 32 and 64 bit "general purpose" registers, albeit without the possibility to specify sw (i.e. sw=5 for 32 bit registers and 6 for 64 bit ones).
The Intel Software Development Emulator ≥ 4.29 (dated 2011-07-01) is capable of emulating these instructions.

Bit permutation primitive

Even a stripped down variant of the bit permutations instruction only capable of doing a single permutation step could greatly speed up most bit permutations and has the potential to overtake the lookup variants. It is also called delta swap.
This would be the equivalent of the following routine (word is unsigned and has the size of a register):

word bit_permute_step(word source, word mask, int shift) {
  word t;
  t = ((source >> shift) ^ source) & mask;
  return (source ^ t) ^ (t << shift);
  }

The parameter shift should not be restricted to powers of 2.
This routine effectively resembles the following routines:

For a word size of 32 bit at most 5 such instructions are necessary for a BPC permutation and at most 9 for an arbitrary bit permutation.
It would be nice to have one variant where the shift is a constant and another variant where it is drawn from a register.
Evidently this instruction would be useful for GPR, SSE, AVX2 and even wider future registers.

An enhanced variant of this instruction creates two results:

void bit_permute_step(word* arg1, word* arg2, word mask, int shift) {
  word t;

  t = ((*arg2 >> shift) ^ *arg1) & mask;
  *arg1 = *arg1 ^ t;
  *arg2 = *arg2 ^ (t << shift);
  }

If *arg1 and *arg2 are the same we obviously transform this routine to the one above. Similar to above *arg1 and *arg2 should be registers, mask should be reg/mem, and shift should be a register or a constant.
I found this routine as macros in Linux' c2p_core.h and Ghostscript's gsutil.c as well as in Hacker's Delight, (figure 7.4/7.8, "Straight-line code for transposing a 32x32-bit matrix"). Credits presumably go to Michael Kalms aka Scout but it could well be that the routine is much older.

Finally an even slightly more useful variant of these functions uses rotates instead of shifts and acts typically a direct replacement. This modification lifts the restriction that no bits should be shifted away.

All those routines should be implementable to run in at most 2 cycles latency without using much die space.

Other ideas

Modified excerpt from a comment.

Here are other ideas for new instructions in no particular order:

"subx" analog to the new andn/sarx/… bmi instructions featuring 2 sources and an independent target.
mulx with 4 definable registers, why is there a dependency on edx?
Universal bitwise blending for sse and avx registers, i.e. AMD's vpcmov to replace the often used combination pand/pandn/por (or the float equivalents). It would come handy for general purpose registers too.
"packwb" shall work analog to packuswb/packsswb but without saturation. Also for other sizes.
pack/punpck for avx without slices, i.e. natural extension from mm=>xmm=>ymm (cross-lane).
palignr, pslldq, psrldq for avx registers without slices.
Horizontal ops for avx registers without slices.
Enhanced pshufb for sse and avx which can fetch from 2 sources and without slices, like AMD's vpperm.
Unsigned average also for dwords and larger (sse/avx).
Signed average (sse/avx).
"psadwd" analog to psadbw: Sum of absolute differences also for other sizes.
Absolute differences without summing, signed and unsigned.
Load/store a variable number of bytes of sse/avx registers, enabling e.g. to store (the least significant) 3 bytes. The number of bytes to be moved shall come from a register (e.g. cl) or be a constant. This should be faster than the complicated to use maskmovdqu or a series of pextr instructions.
pcmp variants which shall compare unsigned values, e.g. like AMD's vpcomu.
sse/avx (packed singles and doubles) instruction for linear interpolation / blending: a*f+b*(1-f) [=(a-b)*f+b].
Dito for bytes/words/…, e.g. for unsigned bytes: (a*f+b*(256-f))>>3 [=((a-b)*f)>>3)+b].
Generalized bit reversal on general purpose and AVX registers with constant (8 bits are sufficient for AVX) control parameter. A variant with variable controlling would be nice.
Swap by primitives on general purpose and AVX registers.
pdep and pext for arbitrary subwords and for sse/avx registers, see also my proposal of bit permutations.
Very usable but also very expensive would be the GRP instruction.
Nice to have would also be other special permutations such a rotation of an 8×8 bit matrix by 90° and 270° or a diagonal flip.

You may bookmark this page as http://programming.sirrida.de?x86_proposals.html.
Last change: 2013-01-12

My proposals for new x86 instructions

Contents

Push effective address

Mov small constant

Extended setcc

Gather: rep xlat

Gather: Lab translate

Gather and scatter with configurable factor

Bit permutations

Bit permutation primitive

Other ideas