I have posted most these topics also in Intel's AVX and CPU Instructions forum; this is my view of the discussion.
Comments or questions? E-mail: info@sirrida.de
Discussed in the forum.
Let me propose a new command: Push effective address: pushea.
For example in 32 bit mode "pushea [ea]" is meant as a replacement for the combination "lea eax,[ea]" / "push eax" but without trashing eax.
A possible encoding could be f2 ff /6.
Discussed in the forum.
Let me propose a replacement encoding for loading tiny values via mov: "mov reg,const".
By utilizing the unused and reserved form "lea reg,reg" (8d 11rrrxxx) I propose the following meaning:
rrr is the target register and xxx shall be encoded as
xxx | Meaning |
---|---|
000 | A signed byte follows, i.e. -128…127 |
001 | 1 |
010 | 2 |
011 | Maximum signed value (for 32 bit: 0x7fffffff) |
100 | A signed word follows, i.e. -32768…32767 |
101 | Minimum signed value (for 32 bit: 0x80000000) |
110 | -2 |
111 | -1 |
Examples:
All these encodings are better than b8 xx xx xx xx (5 bytes).
Needless to say that the usual prefixes for word and qword should work as well. Example:
"mov rax,-2": 48 8d fe (3 bytes)
instead of
48 c7 c0 fe ff ff ff (7 bytes)
The need to have denser encoding seems to be there since I have sometimes seen replacements for assigning small constants:
The special case of 0 is very common and (almost?) always treated specially. In my experience assignments of small constants to register occur quite often.
Nearly all the arithmetic commands have a special notation for 8 bit immediate values: and,or,xor,cmp,add,adc,sub,sbb - and there is inc/dec…
My proposal does not break compatibility since a coding is used which traps as an illegal opcode.
Discussed in the forum, slightly modified.
Let me propose an extension to the command setcc.
Often we need booleans not as bytes with the values 0 and 1…
By utilizing the 3 currently unused bits I propose the following meaning:
Bit | Meaning | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#2 | Size: 0=byte, 1=dword | ||||||||||||||||||||||||||||||
#1…#0 |
|
As a special case for the combination 111 and the coding as for target register esp only the carry flag should be set accordingly.
Needless to say that the usual prefixes for word and qword should work as well.
An encoding as 0f 9x /y would be OK if the superfluous bits (reg field) are assumed 0, but they have not (yet?) been documented as reserved…
An encoding without this problem could therefore be this: f2 0f 9x /y.
I give some more or less silly and untested examples, however I am not yet sure about the concrete syntax for the new forms. See an idea below.
There are a lot of cases where a programmer defines a boolean type e.g. as int:
typedef int bool;
In this case the statement "bool b = (x==17)" can be expressed by
cmp [x],17 sete dword ptr [b] // the form false=>0, true=>1; dword
as a replacement for
cmp [x],17 sete al movzx eax,al mov dword ptr [b],eax
The Windows API typically uses these booleans named BOOL.
The sgn function:
int sgn(int x) { if (x<0) return -1; else if (x>0) return 1; else return 0; }
can be written as
test eax,eax setg eax,sign // the form false=>-1, true=>1; dword setz eax,zero // the form false=>unchanged, true=>0; dword
as a replacement for
cdq neg eax shr eax,31 or eax,edx
The Visual Basic boolean data type as well as Delphi's longbool/wordbool/bytebool defines false as 0 and true as -1. These boolean values can simply generated with e.g.
cmp … setcc eax,mask // the form false=>0, true=>-1; dword
as a replacement for
cmp … setcc al movzx eax,al neg eax
If the value must be written to memory:
cmp … setcc dword ptr [x],mask // the form false=>0, true=>-1; dword
as a replacement for
cmp … setcc al movzx eax,al neg eax mov dword ptr [x],eax
The fragment
q = (q << 1) + (x > 17);
can be written as
cmp [x],17 setg // the form where only the carry flag is set adc eax,eax // assuming that q is in register eax
or
cmp [x],17 setg edx // the form false=>0, true=>1; dword lea eax,[eax*2+edx] // assuming that q is in register eax
as a replacement for
cmp [x],17 setg dl movzx edx,dl lea eax,[eax*2+edx] // assuming that q is in register eax
The signed average, i.e. (a+b)/2, can be written as
add eax,edx setl edx,mask // the form false=>0, true=>-1; dword sub eax,edx rcr eax,1
as a replacement for
xor ecx,ecx add eax,edx setnl cl dec ecx sub eax,ecx rcr eax,1
Posted as a comment.
It would be fine if at least the gather command were there; in my experience the scatter command is not used as often.
Programming such endless chains of insertxx commands is ugly, stupid, slow, space-inefficient and really should be unnecessary. All these commands must be decoded and executed, and needlessly fill the memory and μop caches. Gather commands could at least generate the flood of μops by themselves and the out-of-order mechanism will care for the proper interlacing with other commands which can be executed in parallel (superscalar). I know that such a command will have a long latency.
…and things get even worse if one wants to gather small entities such as bytes. I have often missed commands such as "rep xlat" ("jecxz end; redo: lodsb; xlat; stosb; loop redo; end:" with al preserved). If gather was there, it could be simulated with some code around "gather.byte xmm1,[ea]" where the bytes of xmm1 hold unsigned offsets to ea (effective memory address).
By the way: I have never understood why e.g. the handy commands "rep stosb", "rep movsb", "jecxz", "loop", "enter" (level 0) and "leave" can be outperformed by any replacement stuff. Modern CPUs ought to be smart enough to execute such dedicated commands much faster. It is a shame that they do not!
Let us get concrete.
Here is a
follow-up
in the
forum:
I heartily wish the string commands "rep stosb", "rep movsb", "rep cmpsb"
and "rep scasb" (yes, especially the byte operations) were so fast and
smart that all the many replacement routines would be outperformed
and thus obsoleted.
See e.g. Agner Fog's
subroutine library
for examples of such routines and the necessary effort to choose the right one.
You will find my final proposal of the gather command in Gather and scatter with configurable factor.
Posted as a comment.
Here is a real world example usage for byte/word scatter/gather commands:
Assuming an image source of L*a*b pixels, each 3*8 bit, we want to perform a gamma correction on L and a color correction (e.g. gray balance) on a and b.
For the gamma correction there is a byte array: unsigned char L_tab[256].
For the color correction there is a word array: unsigned short ab_tab[65536]
Now we face the first problem:
AoS to SoA.
We might load 3 xmm registers and perform some pshufb and palignr commands.
Doing so a variant of pshufb with at least 2 sources is sourly missing, such as
AMD's
new
xop
command
vpperm.
A byte gather command comes handy here,
and in this case AoS to SoA can be done in 3 commands
if a 3 operand form is available.
Here are the needed offsets for xmm usage:
L_idx: db 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45 ab_idx: db 1, 2, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 20, 22, 23 ; 25, 26, 28, 29, 31, 32, 34, 35, 37, 38, 40, 41, 43, 44, 46, 47
The next problem is the actual lookup for L and ab. With byte gather for L and word gather for ab this is easy.
Finally we want to transform SoA back to AoS.
Here we obviously need scatter commands doing the reverse of AoS to SoA.
In this case we could emulate it by storing to temporary memory,
use gather operations to swizzle and finally store to target.
Here is a coding example using xmm registers. The extension to ymm or zmm is simple.
movdqu xmm3,[L_idx] movdqu xmm4,[ab_idx_1] lea eax,[src] lea edx,[tgt] lea esi,[L_tab] lea edi,[ab_tab] @@loop: vgather.byte xmm0,xmm3,[eax] ; AoS to SoA: load L values vgather.byte xmm1,xmm4,[eax] ; AoS to SoA, load ab values, part 1 vgather.byte xmm2,xmm4,[eax+24] ; AoS to SoA, load ab values, part 2 add eax,3*16 gather.byte xmm0,[esi] ; translate L gather.word xmm1,[edi] ; translate ab, part 1 gather.word xmm2,[edi] ; translate ab, part 2 vscatter.byte [edx], xmm0,xmm3 ; SoA to AoS, store L values vscatter.byte [edx], xmm1,xmm4 ; SoA to AoS, store ab values, part 1 vscatter.byte [edx+24],xmm2,xmm4 ; SoA to AoS, store ab values, part 2 add edx,3*16 dec ecx jnz @@loop
The corresponsing scalar code looks like this:
lea eax,[src] lea edx,[tgt] lea esi,[L_tab] lea edi,[ab_tab] @@loop: movzx ebx,byte ptr [eax] ; load L value movzx ebp,word ptr [eax+1] ; load a and b value add eax,3 mov bl,[ebx+esi] ; translate L mov si,[ebp+edi] ; translate ab mov [edx],bl ; store L value mov [edx+1],si ; store a and b value add edx,3 dec ecx jnz @@loop
You will find my final proposal of the gather command in Gather and scatter with configurable factor.
Here is my view of the summary of the thread Request for extension to scatter/gather.
As an extension to
Gather: rep xlat
and
Gather: Lab translate
I would like to propose yet another variant of a gather instruction
capable of fetching from a column of a table with an arbitrary width.
The stride (table width in bytes) need not be a multiple of the element size.
The need arises because at least for bytes and words
the possible indexes are too much limited to perform the given task.
The indexes (line numbers) of the elements
need not be ascending from 0 up but shall be stored
in the corresponding elements of an SSE or AVX register.
What we need is a scatter/gather command like this:
gather.word result, indexes, base_addr, stride
These are the parameters:
Example: gather.word ymm0, ymm1, [rsi+rbx*4+0x12345], ecx
The functionality (e.g. for words):
for all i do result.word[i] := word ptr [EA + indexes.word[i]*stride]
The stride parameter would be an extension to my original proposal of byte/word scatter/gather.
Other options may include
Similarly a scatter command could be defined as well.
As for the coding of the command: The AVX coding scheme allows for an extra byte where the stride register as well as 4 flags can be specified.
I would like to see a instruction for bit permutations. The proposal for similar things as my proposal is not new. You can find e.g. numerous articles on the web from Yedidya Hilewitz and Ruby B. Lee about butterfly operations and an implementation in hardware, e.g. here.
The syntax could be as follows:
advanced_bit_op target, source, mask, code
The arguments target and source shall be a register,
mask a register or a memory address,
and code an 8 bit constant.
This is perfect for
AVX,
i.e. VEX encoded instructions.
If needed target and source can be merged into one register,
e.g. for an SSE variant.
Alternatively or additionally there can be an implementation for
general-purpose registers.
The actual function shall be encoded in code. The heart of these functions shall be a butterfly operation. For future extensions this might not be true.
code is split into sw and op, each 4 bits.
sw is the binary logarithm (log2) of the subword size, e.g. 0=bit, 3=byte, 7=xmm register; values greater than usable shall be reserved and must not be used.
op is the function type to be performed.
The highest bit in op shall denote the inverse operation.
In my example encoding 0…4 are inverse butterfly operations
and 8…c are the corresponding butterfly operations:
op | Meaning |
---|---|
0 | compress_right |
1 | compress_left |
2 | compress_right_flip |
3 | compress_left_flip |
4 | Explicit inverse butterfly steps at levels sw and sw+1 |
5…7 | Reserved |
8 | expand_right |
9 | expand_left |
a | expand_right_flip |
b | expand_left_flip |
c | explicit butterfly steps at levels sw+1 and sw |
d…f | Reserved |
For the explicit
butterfly
step operations (cases 4 and c) even values of sw suffice.
Since a butterfly step only needs half of the bits of mask
as the configuration,
two such configurations can be packed into mask.
The operations manageable by the [inverse] butterfly network and needing an expensive calculation for the configuration should hold the configuration in a cache where an entry can be identified by
The configuration itself needs
log2(register_size)*register_size/2 bits.
For SSE (XMM register) we therefore need
128 + 1 + 3 + 7*128/2 = 580 bits for one cache entry;
for AVX (YMM register) it will be 256 + 1 + 4 + 8*256/2 = 1285 bits.
The configuration cache and both the butterfly network and the inverse one shall be present on each core. The machinery for calculating the mask might be once per die, possibly micro programmed; it need not be very fast due to the cache. The cache need not be saved across thread switches and can even be used by any thread due to its universal character. There can be more than one cache entry. Butterfly operations should run in 1 cycle.
This proposal seems to be partially obsolete now.
In the document
319433-011
(June 2011)
the instructions compress_right and
expand_right
are introduced (proposed for 2013, released about 2013-06) by
Intel
as "Haswell New Instructions"
and are named PEXT and PDEP
(BMI2 instruction set).
They act on 32 and 64 bit "general purpose" registers,
albeit without the possibility to specify sw
(i.e. sw=5 for 32 bit registers and 6 for 64 bit ones).
The
Intel Software Development Emulator
≥ 4.29 (dated 2011-07-01)
is capable of emulating these instructions.
Even a stripped down variant of the
bit permutations
instruction
only capable of doing a single permutation step
could greatly speed up most bit permutations
and has the potential to overtake the
lookup variants.
It is also called
delta swap.
This would be the equivalent of the following routine
(word is unsigned and has the size of a register):
word bit_permute_step(word source, word mask, int shift) { word t; t = ((source >> shift) ^ source) & mask; return (source ^ t) ^ (t << shift); }
The parameter shift should not be restricted to powers of 2.
This routine effectively resembles the following routines:
For a word size of 32 bit
at most 5 such instructions are necessary for a
BPC permutation
and at most 9 for an
arbitrary bit permutation.
It would be nice to have
one variant where the shift is a constant and
another variant where it is drawn from a register.
Evidently this instruction would be useful for GPR, SSE, AVX2
and even wider future registers.
An enhanced variant of this instruction creates two results:
void bit_permute_step(word* arg1, word* arg2, word mask, int shift) { word t; t = ((*arg2 >> shift) ^ *arg1) & mask; *arg1 = *arg1 ^ t; *arg2 = *arg2 ^ (t << shift); }
If *arg1 and *arg2 are the same
we obviously transform this routine to the one above.
Similar to above *arg1 and *arg2 should be registers,
mask should be reg/mem,
and shift should be a register or a constant.
I found this routine as macros in
Linux'
c2p_core.h
and
Ghostscript's
gsutil.c
as well as in
Hacker's Delight,
(figure 7.4/7.8, "Straight-line code for transposing a 32x32-bit matrix").
Credits presumably go to
Michael Kalms aka Scout
but it could well be that the routine is much older.
Finally an even slightly more useful variant of these functions uses rotates instead of shifts and acts typically a direct replacement. This modification lifts the restriction that no bits should be shifted away.
All those routines should be implementable to run in at most 2 cycles latency without using much die space.
Modified excerpt from a comment.
Here are other ideas for new instructions in no particular order: