Back to main page

My proposals for new x86 instructions


Back to main page

Contents

I have posted most these topics also in Intel's AVX and CPU Instructions forum; this is my view of the discussion.

Comments or questions? E-mail: info@sirrida.de


(To top) Push effective address

Discussed in the forum.

Let me propose a new command: Push effective address: pushea.

For example in 32 bit mode "pushea [ea]" is meant as a replacement for the combination "lea eax,[ea]" / "push eax" but without trashing eax.

A possible encoding could be f2 ff /6.


(To top) Mov small constant

Discussed in the forum.

Let me propose a replacement encoding for loading tiny values via mov: "mov reg,const".

By utilizing the unused and reserved form "lea reg,reg" (8d 11rrrxxx) I propose the following meaning:

rrr is the target register and xxx shall be encoded as

xxxMeaning
000A signed byte follows, i.e. -128…127
0011
0102
011 Maximum signed value (for 32 bit: 0x7fffffff)
100 A signed word follows, i.e. -32768…32767
101 Minimum signed value (for 32 bit: 0x80000000)
110-2
111-1

Examples:

All these encodings are better than b8 xx xx xx xx (5 bytes).

Needless to say that the usual prefixes for word and qword should work as well. Example:

"mov rax,-2": 48 8d fe (3 bytes)

instead of

48 c7 c0 fe ff ff ff (7 bytes)

The need to have denser encoding seems to be there since I have sometimes seen replacements for assigning small constants:

The special case of 0 is very common and (almost?) always treated specially. In my experience assignments of small constants to register occur quite often.

Nearly all the arithmetic commands have a special notation for 8 bit immediate values: and,or,xor,cmp,add,adc,sub,sbb - and there is inc/dec…

My proposal does not break compatibility since a coding is used which traps as an illegal opcode.


(To top) Extended setcc

Discussed in the forum, slightly modified.

Let me propose an extension to the command setcc.

Often we need booleans not as bytes with the values 0 and 1…

By utilizing the 3 currently unused bits I propose the following meaning:

BitMeaning
#2Size: 0=byte, 1=dword
#1…#0
BitsfalsetrueKeywordComments
0001-Classical case as in C or Pascal
010-1maskFor e.g. bit masks
10-11signSign
11(no change)0zeroKeep or zero out
11NCC-Reset or set carry flag

As a special case for the combination 111 and the coding as for target register esp only the carry flag should be set accordingly.

Needless to say that the usual prefixes for word and qword should work as well.

An encoding as 0f 9x /y would be OK if the superfluous bits (reg field) are assumed 0, but they have not (yet?) been documented as reserved…

An encoding without this problem could therefore be this: f2 0f 9x /y.

I give some more or less silly and untested examples, however I am not yet sure about the concrete syntax for the new forms. See an idea below.

There are a lot of cases where a programmer defines a boolean type e.g. as int:

typedef int bool;

In this case the statement "bool b = (x==17)" can be expressed by

cmp [x],17
sete dword ptr [b]  // the form false=>0, true=>1; dword

as a replacement for

cmp [x],17
sete al
movzx eax,al
mov dword ptr [b],eax

The Windows API typically uses these booleans named BOOL.


The sgn function:

int sgn(int x) {
  if (x<0)
    return -1;
  else if (x>0)
    return 1;
  else
    return 0;
  }

can be written as

test eax,eax
setg eax,sign  // the form false=>-1, true=>1; dword
setz eax,zero  // the form false=>unchanged, true=>0; dword

as a replacement for

cdq
neg eax
shr eax,31
or eax,edx

The Visual Basic boolean data type as well as Delphi's longbool/wordbool/bytebool defines false as 0 and true as -1. These boolean values can simply generated with e.g.

cmp …
setcc eax,mask  // the form false=>0, true=>-1; dword

as a replacement for

cmp …
setcc al
movzx eax,al
neg eax

If the value must be written to memory:

cmp …
setcc dword ptr [x],mask  // the form false=>0, true=>-1; dword

as a replacement for

cmp …
setcc al
movzx eax,al
neg eax
mov dword ptr [x],eax

The fragment

q = (q << 1) + (x > 17);

can be written as

cmp [x],17
setg  // the form where only the carry flag is set
adc eax,eax  // assuming that q is in register eax

or

cmp [x],17
setg edx  // the form false=>0, true=>1; dword
lea eax,[eax*2+edx]  // assuming that q is in register eax

as a replacement for

cmp [x],17
setg dl
movzx edx,dl
lea eax,[eax*2+edx]  // assuming that q is in register eax

The signed average, i.e. (a+b)/2, can be written as

add eax,edx
setl edx,mask  // the form false=>0, true=>-1; dword
sub eax,edx
rcr eax,1

as a replacement for

xor ecx,ecx
add eax,edx
setnl cl
dec ecx
sub eax,ecx
rcr eax,1

(To top) Gather: rep xlat

Posted as a comment.

It would be fine if at least the gather command were there; in my experience the scatter command is not used as often.

Programming such endless chains of insertxx commands is ugly, stupid, slow, space-inefficient and really should be unnecessary. All these commands must be decoded and executed, and needlessly fill the memory and μop caches. Gather commands could at least generate the flood of μops by themselves and the out-of-order mechanism will care for the proper interlacing with other commands which can be executed in parallel (superscalar). I know that such a command will have a long latency.

…and things get even worse if one wants to gather small entities such as bytes. I have often missed commands such as "rep xlat" ("jecxz end; redo: lodsb; xlat; stosb; loop redo; end:" with al preserved). If gather was there, it could be simulated with some code around "gather.byte xmm1,[ea]" where the bytes of xmm1 hold unsigned offsets to ea (effective memory address).

By the way: I have never understood why e.g. the handy commands "rep stosb", "rep movsb", "jecxz", "loop", "enter" (level 0) and "leave" can be outperformed by any replacement stuff. Modern CPUs ought to be smart enough to execute such dedicated commands much faster. It is a shame that they do not!

Let us get concrete. Here is a follow-up in the forum:
I heartily wish the string commands "rep stosb", "rep movsb", "rep cmpsb" and "rep scasb" (yes, especially the byte operations) were so fast and smart that all the many replacement routines would be outperformed and thus obsoleted. See e.g. Agner Fog's subroutine library for examples of such routines and the necessary effort to choose the right one.

You will find my final proposal of the gather command in Gather and scatter with configurable factor.


(To top) Gather: Lab translate

Posted as a comment.

Here is a real world example usage for byte/word scatter/gather commands:

Assuming an image source of L*a*b pixels, each 3*8 bit, we want to perform a gamma correction on L and a color correction (e.g. gray balance) on a and b.

For the gamma correction there is a byte array: unsigned char L_tab[256].
For the color correction there is a word array: unsigned short ab_tab[65536]

Now we face the first problem: AoS to SoA.
We might load 3 xmm registers and perform some pshufb and palignr commands. Doing so a variant of pshufb with at least 2 sources is sourly missing, such as AMD's new xop command vpperm.
A byte gather command comes handy here, and in this case AoS to SoA can be done in 3 commands if a 3 operand form is available. Here are the needed offsets for xmm usage:

L_idx:  db  0,  3,  6,  9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45
ab_idx: db  1,  2,  4,  5,  7,  8, 10, 11, 13, 14, 16, 17, 19, 20, 22, 23
        ;  25, 26, 28, 29, 31, 32, 34, 35, 37, 38, 40, 41, 43, 44, 46, 47

The next problem is the actual lookup for L and ab. With byte gather for L and word gather for ab this is easy.

Finally we want to transform SoA back to AoS. Here we obviously need scatter commands doing the reverse of AoS to SoA.
In this case we could emulate it by storing to temporary memory, use gather operations to swizzle and finally store to target.

Here is a coding example using xmm registers. The extension to ymm or zmm is simple.

  movdqu xmm3,[L_idx]
  movdqu xmm4,[ab_idx_1]
  lea eax,[src]
  lea edx,[tgt]
  lea esi,[L_tab]
  lea edi,[ab_tab]
@@loop:
  vgather.byte xmm0,xmm3,[eax]     ; AoS to SoA: load L values
  vgather.byte xmm1,xmm4,[eax]     ; AoS to SoA, load ab values, part 1
  vgather.byte xmm2,xmm4,[eax+24]  ; AoS to SoA, load ab values, part 2
  add eax,3*16
  gather.byte xmm0,[esi]  ; translate L
  gather.word xmm1,[edi]  ; translate ab, part 1
  gather.word xmm2,[edi]  ; translate ab, part 2
  vscatter.byte [edx],   xmm0,xmm3  ; SoA to AoS, store L values
  vscatter.byte [edx],   xmm1,xmm4  ; SoA to AoS, store ab values, part 1
  vscatter.byte [edx+24],xmm2,xmm4  ; SoA to AoS, store ab values, part 2
  add edx,3*16
  dec ecx
  jnz @@loop

The corresponsing scalar code looks like this:

  lea eax,[src]
  lea edx,[tgt]
  lea esi,[L_tab]
  lea edi,[ab_tab]
@@loop:
  movzx ebx,byte ptr [eax]    ; load L value
  movzx ebp,word ptr [eax+1]  ; load a and b value
  add eax,3
  mov bl,[ebx+esi]  ; translate L


  mov si,[ebp+edi]  ; translate ab
  mov [edx],bl  ; store L value
  mov [edx+1],si  ; store a and b value
  add edx,3
  dec ecx
  jnz @@loop

You will find my final proposal of the gather command in Gather and scatter with configurable factor.


(To top) Gather and scatter with configurable factor

Here is my view of the summary of the thread Request for extension to scatter/gather.

As an extension to Gather: rep xlat and Gather: Lab translate I would like to propose yet another variant of a gather instruction capable of fetching from a column of a table with an arbitrary width. The stride (table width in bytes) need not be a multiple of the element size.
The need arises because at least for bytes and words the possible indexes are too much limited to perform the given task.
The indexes (line numbers) of the elements need not be ascending from 0 up but shall be stored in the corresponding elements of an SSE or AVX register.
What we need is a scatter/gather command like this:

gather.word result, indexes, base_addr, stride

These are the parameters:

Example: gather.word ymm0, ymm1, [rsi+rbx*4+0x12345], ecx

The functionality (e.g. for words):

for all i do
  result.word[i] := word ptr [EA + indexes.word[i]*stride]

The stride parameter would be an extension to my original proposal of byte/word scatter/gather.

Other options may include

Similarly a scatter command could be defined as well.

As for the coding of the command: The AVX coding scheme allows for an extra byte where the stride register as well as 4 flags can be specified.


(To top) Bit permutations

I would like to see a instruction for bit permutations. The proposal for similar things as my proposal is not new. You can find e.g. numerous articles on the web from Yedidya Hilewitz and Ruby B. Lee about butterfly operations and an implementation in hardware, e.g. here.

The syntax could be as follows:

advanced_bit_op target, source, mask, code

The arguments target and source shall be a register, mask a register or a memory address, and code an 8 bit constant.
This is perfect for AVX, i.e. VEX encoded instructions. If needed target and source can be merged into one register, e.g. for an SSE variant. Alternatively or additionally there can be an implementation for general-purpose registers.

The actual function shall be encoded in code. The heart of these functions shall be a butterfly operation. For future extensions this might not be true.

code is split into sw and op, each 4 bits.

sw is the binary logarithm (log2) of the subword size, e.g. 0=bit, 3=byte, 7=xmm register; values greater than usable shall be reserved and must not be used.

op is the function type to be performed. The highest bit in op shall denote the inverse operation.
In my example encoding 0…4 are inverse butterfly operations and 8…c are the corresponding butterfly operations:

opMeaning
0compress_right
1compress_left
2compress_right_flip
3compress_left_flip
4Explicit inverse butterfly steps at levels sw and sw+1
5…7Reserved
8expand_right
9expand_left
aexpand_right_flip
bexpand_left_flip
cexplicit butterfly steps at levels sw+1 and sw
d…fReserved

For the explicit butterfly step operations (cases 4 and c) even values of sw suffice.
Since a butterfly step only needs half of the bits of mask as the configuration, two such configurations can be packed into mask.

The operations manageable by the [inverse] butterfly network and needing an expensive calculation for the configuration should hold the configuration in a cache where an entry can be identified by

The configuration itself needs log2(register_size)*register_size/2 bits.
For SSE (XMM register) we therefore need 128 + 1 + 3 + 7*128/2 = 580 bits for one cache entry; for AVX (YMM register) it will be 256 + 1 + 4 + 8*256/2 = 1285 bits.

The configuration cache and both the butterfly network and the inverse one shall be present on each core. The machinery for calculating the mask might be once per die, possibly micro programmed; it need not be very fast due to the cache. The cache need not be saved across thread switches and can even be used by any thread due to its universal character. There can be more than one cache entry. Butterfly operations should run in 1 cycle.

This proposal seems to be partially obsolete now. In the document 319433-011 (June 2011) the instructions compress_right and expand_right are introduced (proposed for 2013, released about 2013-06) by Intel as "Haswell New Instructions" and are named PEXT and PDEP (BMI2 instruction set). They act on 32 and 64 bit "general purpose" registers, albeit without the possibility to specify sw (i.e. sw=5 for 32 bit registers and 6 for 64 bit ones).
The Intel Software Development Emulator ≥ 4.29 (dated 2011-07-01) is capable of emulating these instructions.


(To top) Bit permutation primitive

Even a stripped down variant of the bit permutations instruction only capable of doing a single permutation step could greatly speed up most bit permutations and has the potential to overtake the lookup variants. It is also called delta swap.
This would be the equivalent of the following routine (word is unsigned and has the size of a register):

word bit_permute_step(word source, word mask, int shift) {
  word t;
  t = ((source >> shift) ^ source) & mask;
  return (source ^ t) ^ (t << shift);
  }

The parameter shift should not be restricted to powers of 2.
This routine effectively resembles the following routines:

For a word size of 32 bit at most 5 such instructions are necessary for a BPC permutation and at most 9 for an arbitrary bit permutation.
It would be nice to have one variant where the shift is a constant and another variant where it is drawn from a register.
Evidently this instruction would be useful for GPR, SSE, AVX2 and even wider future registers.

An enhanced variant of this instruction creates two results:

void bit_permute_step(word* arg1, word* arg2, word mask, int shift) {
  word t;

  t = ((*arg2 >> shift) ^ *arg1) & mask;
  *arg1 = *arg1 ^ t;
  *arg2 = *arg2 ^ (t << shift);
  }

If *arg1 and *arg2 are the same we obviously transform this routine to the one above. Similar to above *arg1 and *arg2 should be registers, mask should be reg/mem, and shift should be a register or a constant.
I found this routine as macros in Linux' c2p_core.h and Ghostscript's gsutil.c as well as in Hacker's Delight, (figure 7.4/7.8, "Straight-line code for transposing a 32x32-bit matrix"). Credits presumably go to Michael Kalms aka Scout but it could well be that the routine is much older.

Finally an even slightly more useful variant of these functions uses rotates instead of shifts and acts typically a direct replacement. This modification lifts the restriction that no bits should be shifted away.

All those routines should be implementable to run in at most 2 cycles latency without using much die space.


(To top) Other ideas

Modified excerpt from a comment.

Here are other ideas for new instructions in no particular order:


You may bookmark this page as http://programming.sirrida.de?x86_proposals.html.
Last change: 2013-01-12