Programming topics

Speeding up case of string (Pascal)
Counting loops in C/C++
Dealing with endian
Singletons / Do once only
Manipulating rightmost bits
Iterators

This is a collection of programming hints and techniques. Although primarily dedicated to Pascal and/or C/C++ many ideas are applicable for other programming languages as well. In some cases also a compiler developer can profit.

Comments or questions? E-mail: info@sirrida.de

Speeding up case of string (Pascal)

The implementation of Free Pascal's case for strings is not more than an else-if chain where the strings are compared one by one; this effectively is only "syntactic sugar". This applies at least to versions up to 2.7.1 (2013-01). It is a shame that Free Pascal creates such poor code.
It is even more a shame that Delphi has no string-case at all, at least up to the version XE3 (2013-01-24) although this language feature was demanded many times.
As you can see in Dispatching on strings in theory, the compiler could do much better.

On work we often have quite large string-case constructs. As written above Delphi does not know about a string-case. Therefore as a kludge to speed up our Delphi programs I often pre-dispatch on the first character and then use an else-if chain. This is very simple and reasonably fast (at least considering the simplicity), but still quite readable and maintainable. The goto statements are somewhat ugly but there is no real alternative. This transformation is essentially an example of what the compiler should have done.
I have special cased the empty string because the empty string is represented by a nil pointer; access to any character of it typically results in an access violation. This is true for all dynamic string types.

label _ELSE;
…
if s='' then  // very fast, preventing AV
  …
else begin
  case s[1] of  // s is non-empty, medium fast
    'a': begin
      if false then
      else if s='abc' then  …
      else if s='axx' then  …
      else goto _ELSE;
      end;
    'z': begin
      if false then
      else if s='zzz' then  …
      else goto _ELSE;
      end;
    else begin
_ELSE:
      …
      end;
    end;
  end;

Using a precompiler could also be a solution to problems like this but being forced to always apply an extra compilation step is probably overkill.
Unfortunately Pascal typically has no such thing as a preprocessor.
See also the chapter Switch on strings (C/C++).

Counting loops in C/C++

Have you ever tried to program a counting loop in C which works correctly in all cases, especially including empty ranges and the complete range of a type such as 0…255 for unsigned char aka byte without using a bigger type?
Your first attempt probably resulted in an endless loop such as this:

unsigned char i;
for (i=0; i<=255; ++i) {…}

Here obviously i<=255 is always true.
The problem is that C's for loop is essentially only a while loop with an additional initialization and an incrementing part, i.e. only syntactic sugar.
While being quite flexible allowing for almost any kind of loop this construction is much more clumsy and error-prone than Pascal's and Modula's for statement: The looping variable must be written at least 3 times and you must pay attention to use the correct relation operator which must match the incrementing or decrementing step. Also, you should pay attention that the comparison is as simple as possible, i.e. that the ending value is only a variable or constant. It's just too easy to make it wrong.
As far as I can see there is no general solution using only one simple conditional expression; also a simple while loop is unable to iterate over the whole range.

In the following I assume that the ending value is constant for the whole loop. The looping variable should be considered read-only in the loop body and undefined elsewhere. We do not want to duplicate the loop body.
One valid general solution for a loop on v using type t over values from s to e is the following:

t e = …
t s = …
if (s <= e) {
  t v=s;
  while (true) {
    …
    if (v == e)
      break;
    ++v;
    }
  }

It would be nice if we could find a macro which does the stuff. Unfortunately a single macro forces us to pack the statement list into one macro parameter, i.e.

FOR(t,v,s,e,
  statement_list
  )

This not only looks strange, i.e. ending a statement list with ")", but also trashes line numbering for these statements if multiple lines are used. In case of a syntax error the compiler would not give us a helpful line number. Also this effectively disables the some of the macros mentioned in Switch on strings (C/C++) where line numbers are part of the inner magic.

A pair of matching macros

FOR1(t,v,s,e)
  statement_list
  FOR2(v)

obviously does the job:

#define FOR1(t,v,s,e) \
  { \
    t v,_e_; \
    v = (s); \
    _e_ = (e); \
    if (v <= _e_) { \
      while (true) { \
        {

#define FOR2(v) \
          } \
        if (v == _e_) \
          break; \
        ++v; \
        } \
      } \
    }

This however is ugly because we have to provide the looping variable twice. We also must pay attention that these macros match, especially when we want to provide macros for incrementing and decrementing loops.

Much better is a looping macro and a generalized END macro which does not need any parameters, e.g.:

FOR(unsigned char, v, 0, 255)
  statement_list
  END

This can be realized as follows:

#define FOR(t,v,s,e) \
  { \
    t _v_, _e_; \
    _v_ = (s); \
    _e_ = (e); \
    if (_v_ <= _e_) { \
      switch (0) { \
        while (_v_ != _e_) { \
          ++_v_; \
          default: \
          { \
            const t v = _v_;

#define END }}}}}

Here I give up some simplicity and possibly some speed to achieve the goal.

You can find this macro and other related macros for Modula-style control flow statements in modula.h and a small test program for the loop macros in testloop.c.
By the way: The many seemingly superfluous braces are needed in order to be able to use the same END macro not only for this FOR macro, but also for the string switch macros and all other related macros. You will find their formal grammar in modula.txt.

Dealing with endian

For data exchange purposes it is necessary to be explicit about the endian (i.e. byte order) of stored values. I have often seen the usage of converting routines which do a byte swap if needed; as far as I can see this is error-prone and not very elegant.
To cope with the problem in an easy way we often use special objects/records/structs/classes which make the endian explicit; the access thereof is done by methods which perform the byte swaps; doing so makes it nearly impossible to use them in a wrong way. Additionally the source code directly shows the endian in the definition of the transfer record.
Several such records (such as a 3 byte Motorola type for scanner usage, a reminiscent of SCSI) can easily be combined into byte packed records which can be directly exchanged with the file I/O system. Obviously it is often necessary to force byte-wise structure packing with a compiler directive.
You should always use the same interface for all endian types. It might make sense to force the exclusive usage of such types in exchange records as a company policy.

You can find a simple example in endian.*.

Free Pascal defines one of these symbols:

ENDIAN_LITTLE Defined when the Free Pascal target is a little-endian processor (80x86, Alpha, ARM).
ENDIAN_BIG Defined when the Free Pascal target is a big-endian processor (680x0, PowerPC, SPARC, MIPS).

Obviously this method to divide the storage from its access can easily be extended to other uses as well such as dealing with uncommon floating point formats.

Singletons / Do once only

If you want to acquire a singleton or need to build a structure only once as the hash maps in Switch on strings (C/C++), the simplest solution is this:

bool ready = false;
…
if (!ready) {
  // Acquire singleton / build structure
  ready = true;
  }

However this is not at all thread-safe. Instead you may use the following locking strategy (the first if should be as fast as possible):

volatile bool ready = false;
…
if (!ready) {
  // Set a load barrier
  // Enter critical section
  if (!ready) {
    // Acquire singleton / build structure
    // Set a store barrier
    ready = true;
    }
  // Leave critical section
  }

This is the fastest method I could find. Unfortunately this pattern is not without problems, especially on multi-processor machines, which are related to memory access ordering and caching.
volatile is necessary to force ready to be always be fetched from memory, because otherwise other threads have no chance to see a change in the second if, when ready is cached in a register; moreover the second if could be optimized away.
The critical section and the second if is necessary to be thread-safe.
The memory barriers are necessary when there are multiple processors which use out-of-order memory accesses. For many systems these barriers are not necessary.
See also an article about Double-checked locking.

Manipulating rightmost bits

Assuming that numbers are processed in two's complement representation (i.e. -1 = ~0) several simple operations are possible which look for the rightmost clear / set bit.
Unfortunately there is no similar set of operations for the leftmost bits. The (yet?) only x86 instruction which can mirror bits is AMD's new XOP instruction VPPERM.
As an example the source bit pattern …x01… stands for an arbitrary number of bits (possibly none) followed by a 0 bit followed a contiguous row of 1 bits (possibly none); this means that we have to look for the rightmost 0 bit.
There is e.g. an operation …001… which zeros out all the x bits [x & ~(x+1)].
Another operation …~11… inverts all x bits and toggles the rightmost 0 bit [~x | (~x-1)].
For some of the non-trivial operations there are special instructions for newer x86 processors which are from the sets BMI1 ("bit modification instructions") and TBM ("trailing bit modification", currently AMD only).
The given x86 instruction sequences modify the eax register.
Please note that the table sports hints when you move your mouse over the fields.

To illustrate a concrete example we take the binary number 1011 (decimal 11) and apply the INC instruction (…x01… => …x10…).
We locate the rightmost 0 bit (1011) and then invert this 0 and all 1 bits to the right of it; we get 1100 (decimal 12).
Let us repeat this action. We locate the rightmost 0 bit 1100 and then invert this 0 and all 1 bits to the right of it (there are none); we get 1101 (decimal 13).
Indeed, this function increments the given number.

Source	Target	Mode	Instruction	Set	Formula(s)	Implementation in x86 assembler	0	-1
…x01… Search for lowest 0 bit	…000…	0x00	MOV		0	xor eax,eax / mov eax,0	0	0
	…001…	0x01			x & ~(x+1); x & (~x-1)	lea edx,[eax+1]; not edx; and eax,edx	0	-1
	…010…	0x02	BLCIC	TBM	~x & (x+1)	lea edx,[eax+1]; not eax; and eax,edx	1	0
	…011…	0x03	BLCMSK	TBM	x ^ (x+1)	lea edx,[eax+1]; xor eax,edx	1	-1
	…100…	0x04			~(x ^ (x+1)); x ^ ~(x+1)	lea edx,[eax+1]; xor eax,edx; not eax	-2	0
	…101…	0x05	BLCI	TBM	x \| ~(x+1); x \| (~x-1)	lea edx,[eax+1]; not edx; or eax,edx	-2	-1
	…110…	0x06	T1MSKC	TBM	~x \| (x+1)	lea edx,[eax+1]; not eax; or eax,edx	-1	0
	…111…	0x07	MOV		-1	or eax,-1 / mov eax,-1	-1	-1
	…x00…	0x08	BLCFILL	TBM	x & (x+1)	lea edx,[eax+1]; and eax,edx	0	0
	…x01…	0x09	-		x	-	0	-1
	…x10…	0x0a	INC		x+1	inc eax	1	0
	…x11…	0x0b	BLCS	TBM	x \| (x+1)	lea edx,[eax+1]; or eax,edx	1	-1
	…~00…	0x0c			~(x \| (x+1)); ~x & ~(x+1)	lea edx,[eax+1]; or eax,edx; not eax	-2	0
	…~01…	0x0d			~x-1; ~(x+1); -2-x	not eax; dec eax	-2	-1
	…~10…	0x0e	NOT		~x	not eax	-1	0
	…~11…	0x0f			~(x & (x+1)); ~x \| ~(x+1)	lea edx,[eax+1]; and eax,edx; not eax	-1	-1
…x10… Search for lowest 1 bit	…000…	0x10	MOV		0	xor eax,eax / mov eax,0	0	0
	…001…	0x11	TZMSK	TBM	~x & (x-1); ~(x \| -x); (x & -x)-1	lea edx,[eax-1]; not eax; and eax,edx	-1	0
	…010…	0x12	BLSI	BMI1	x & -x; x & (~x+1)	mov edx,eax; neg eax; and eax,edx	0	1
	…011…	0x13	BLSMSK	BMI1	x ^ (x-1); -x ^ ~x	lea edx,[eax-1]; xor eax,edx	-1	1
	…100…	0x14			x ^ -x	mov edx,eax; neg eax; xor eax,edx	0	-2
	…101…	0x15	BLSIC	TBM	~x \| (x-1)	lea edx,[eax-1]; not eax; or eax,edx	-1	-2
	…110…	0x16			x \| -x	mov edx,eax; neg eax; or eax,edx	0	-1
	…111…	0x17	MOV		-1	or eax,-1 / mov eax,-1	-1	-1
	…x00…	0x18	BLSR	BMI1	x & (x-1)	lea edx,[eax-1]; and eax,edx	0	-2
	…x01…	0x19	DEC		x-1	dec eax	-1	-2
	…x10…	0x1a	-		x	-	0	-1
	…x11…	0x1b	BLSFILL	TBM	x \| (x-1)	lea edx,[eax-1]; or eax,edx	-1	-1
	…~00…	0x1c			~(x \| (x-1)); -x & ~x	lea edx,[eax-1]; or eax,edx; not eax	0	0
	…~01…	0x1d	NOT		~x	not eax	-1	0
	…~10…	0x1e	NEG		-x	neg eax	0	1
	…~11…	0x1f			~(x & (x-1)); -x \| ~x	lea edx,[eax-1]; and eax,edx; not eax	-1	1

I assigned the operation modes in order to demonstrate additional properties. Let us define the function tbm(x, mode) where mode is the operation mode. You will find an implementation in perm_bas.*.

Generally the following holds:

tbm(x, mode) = tbm(x, mode & 0x11) | tbm(x, mode & 0x12) | tbm(x, mode & 0x1c)

For (mode & 0x08) == 0 (upper bits are filled with 0 or 1) the following holds:

tbm(x, mode) = ~tbm(x, mode ^ 0x07)
tbm(x, mode) = tbm(~x, mode ^ 0x10)

For (mode & 0x08) != 0 (upper bits are kept or inverted) the following holds:

tbm(x, mode) = ~tbm(~x, mode ^ 0x13)
tbm(x, mode) = tbm(~x, mode ^ 0x14)
tbm(x, mode) = tbm(x+q, mode ^ 0x10)

Here q = 1 for (mode & 0x10) == 0 and q = -1 for (mode & 0x10) != 0.

Iterators

Would not it be nice to be able to externalize the logic of looping or other "framing routines"?
There is a solution: iterators! It is a shame that this simple mechanism has not caught on but instead is usually mimicked by inferior and/or complicated constructions; try to implement an iterator which steps a tree.
Life could be so simple. Caution: Sarcasm present in this section.

Once upon a time there were some clever folks who designed ALGOL 58 aka IAL (1958). This language proposed a special parameter passing method which was called pass-by-name and allowed to evaluate parameters in the context of the caller whenever the called procedure wanted to do so. At least in ALGOL-60 (1960) this feature really worked. Some gurus such as Donald E. Knuth liked it but it seems that most programmers have not understood its striking value. As it is very simple to abuse this feature, especially when using it to also write to parameters, it became discredited and consequently was more or less banned from subsequent languages…

A different approach specialized for loops was implemented in Alphard and a bit later in CLU (1975). The idea was to provide a special loop statement where most of the loop control was performed by an "iterator".

The C++ STL features iterators which are somewhat flexible but neither simple nor are intuitively or fool-proof to use.
The delegates of C++11 do the intended job but are also neither simple nor readable nor produce perfect code.
C/C++ does not support nested functions. MetaWare aka Synopsys however demonstrated a collection of compilers (C/C++/Pascal) which supported nested functions and iterators probably since 1985.
The committees seem to shy away from introducing simple things like nested functions because "nobody needs these and it can be emulated anyway", also, communities like BOOST seem to prefer hacks to clean solutions in the very same spirit. This is IMHO programming with pincers and hammers.

C# also features iterators but these can not delegate the yield to other routines nor can act recursively. Instead of doing what is needed in a straight-forward way, they do a monstrous job of code morphing, thereby generating a class and a state machine. Bizarre!
The delegates of C# do the intended job like the ones of C++ but are also too complicated.
There must be a reason why some containers such as lists offer a dedicated iterator method such as ForEach which is much faster and simpler than the foreach language feature…
C# does not support nested functions.

ISO Pascal features procedural parameters capable of transferring nested procedures for a long time. Also, Free Pascal supports nested procedures as parameters since version 2.6.0. Unfortunately Borland/Embarcadero did not implement them until now although their Turbo-Vision (1990) sketched iterators by using nested procedures in an unsafe way.

IMHO the solution was present almost from the very beginning but maybe the necessities were not really understood correctly. All that is needed is the ability to transfer a block of code with the meaning of a nested procedure; evidently this block should have access to all variables as usual.
The potential looping should not be thought to be the duty of the calling code, hence break is not an option. Also, there is no need to generate any object, capture any variable or to switch between contexts or even coroutines; simple function calls are sufficient; needless to say, the appropriate scope parameter must be provided as is usual when dealing with nested functions.

Several years ago I published a working emulation of iterators for Delphi on the Internet but got almost no feedback.
I also posted a feature request on Borland's site. Same result.
Nevertheless this emulation has become an essential tool in the code of the company I work for and proved to greatly simplify life by avoiding complicated alternatives.

What is your opinion? Please tell me, e-mail: info@sirrida.de

You may bookmark this page as http://programming.sirrida.de?programming.html.
Last change: 2017-09-04