# Programming topics

Back to main page

## Contents

This is a collection of programming hints and techniques. Although primarily dedicated to Pascal and/or C/C++ many ideas are applicable for other programming languages as well. In some cases also a compiler developer can profit.

## Hashing

### Intro

A hash function is essentially a mapping function which transforms a given key to a hash value which is a member of a relatively small number range 0…nr_slots-1.
In case you wonder why I have often abbreviated "number" to "nr": I could have also chosen "num" or even "cnt", however the often used "no" is misleading and thus unsuitable.
A hash container uses a hash value as an index to an array of subcontainers. Therefore each subcontainer contains only entries with the same hash value.
A good hash function generates each of the nr_slots possible hash values with about equal probability.
The idea is "divide and conquer", i.e. you can locate an element faster in a smaller container. Given that the hash function is evaluated fast enough and produces sufficiently different hash values, even a simple linear scan through the selected subcontainer suffices, although other strategies can be used as well such as binary trees or even using yet another hash container. For now we choose a linear list.
Obviously the method degenerates to a linear scan through all elements if nr_slots=1 or if the hash function always returns the same value. However for an evenly distributing hash function and when nr_slots is given a value of about the number of elements, we can locate an element with only one or two comparisons on average; also, we can very quickly prove whether an element is absent.
Decreasing the load factor α (which is the number of elements divided by nr_slots) makes the subcontainer smaller and thus the lookup faster but increases the memory consumption.
In the examples of this chapter the hash function takes the first character in ASCII which is masked with 7 using the and operator, i.e. the lowest 3 bits are extracted.

### Chained hashing

The most straight forward implementation of a hashing container goes like this:

Chained hashing

This container is easily filled and queried at runtime. However when the lists grow too large we need to increase nr_slots and to rehash (rebuild) the complete container.

### Hashing with collisions in-place

An alternative to using subcontainers is to put the elements directly into the array thereby circumventing an indirection. In case of a collision, i.e. the needed slot is already used, we can simply use the (circular) next free slot. To prove the absence of an element we need to scan the container until we find a hole or we are at the starting point again.
Obviously it is necessary to enlarge nr_slots at least to the number of elements because otherwise there is no free slot. To achieve reasonable speed, the number of elements should be a good deal smaller than nr_slots, i.e. the load factor α should be well below 1 (100%).
There are some methods to choose the next free slot. A linear scan is cache-friendly but often the performance is a disaster when the number of elements approach nr_slots. Slightly better is to let the increment increase as in quadratic probing. Yet other methods use a second hash function such as in double hashing or cuckoo hashing.

Hashing with probing
Here the magenta elements could not be placed in the correct place and were therefore placed in the next free slot. Empty slots are denoted gray.
When the table finally gets full, obviously the last inserted element will be placed on the last free space. In this example this is "theta" which needs 5 (!) comparisons to be found with linear probing:

Hashing with probing (full)
To prove the absence of any element we need to scan the whole container because of the way too high load factor 1 and thus the absence of empty slots. Chained hashing does not have this problem…

### Static hashing compacted

A container using linked lists has obviously a lot of pointers. For a static container (i.e. no additions and no deletions necessary) there is a compact representation which gets rid of all linkage pointers but needs nr_slots+1 pointers or better yet indexes:

Static hashing compacted
The number of elements of a subcontainer is given by the difference of the index and the next one. The gray indexes are equal to the next index. The arrows in the image point to the top of each element in order to better show the nature of the indexes.

### Static perfect hashing

If we have static container with a perfect hash function (all subcontainers have at most one element), things get simpler, also we can get rid of the loop for checking.

We can omit all next pointers but we still have to use pointers in the main container:

Perfect hashing without next pointers
We can put the elements directly into the array. Empty slots can be marked as such either by a special flag or by supplying any element with a different hash value:

Perfect hashing in-place
Similar to above we can use a compact representation using indexes. Empty slots refer to dummy elements:

Perfect hashing compacted with dummy entries

## Static hashing of integers

### Simple hashing

Let us now consider the case that we have got a constant set of unsigned integer numbers of the type t_base. These could be e.g. the cases of a switch/case statement or the result of an unmasked (raw) hash function of constant objects such as strings.
The main methods to narrow these numbers x to the needed number of slots nr_slots, i.e. 0…nr_slots-1 are

The constants m, s and d are must be chosen such that the result is < nr_slots. If each of the nr_slots slots shall be used, masking and shifting imply that nr_slots is a power of 2; the modulo function and division imply no restriction on nr_slots but are much slower. As a matter of fact, for unsigned numbers masking / shifting are special cases of modulo / division.

In order to enhance distribution, we can modify ("scramble") the numbers before the narrowing. Obviously this only makes sense if things get faster, therefore the scrambling must be very fast.
Here are some very cheap operations which scramble and narrow the raw number x (c is a constant); I assume that any overflows are ignored:

• rol(x, c) & m
• (rol(x, c) ^ x) & m
• (rol(x, c) + x) & m
• (rol(x, c) - x) & m
• ((t_base)(x * c)) >> s

Especially effective is the last one, since most modern CPUs can multiply quite fast (about 4 cycles) and we have lots of constants to choose from.
By supplying suitable constants, we can often significantly reduce the collision lists and in many cases even achieve perfect hashing which typically amortizes the scrambling effort.
For example, using 32 bit numbers, the set of all 32 powers of 2 can be perfectly (and minimally) mapped to 0…31 by a multiplication with e.g. 0x04d7651f giving a 32 bit temporary result and a logical shift right by 27 which extracts the uppermost 5 bits. By the way, this constant resembles a de Bruijn cycle.
Usually finding an optimal constant is difficult. In absence of a better algorithm one might simply try several constants and choose the best found; thousands or even millions of tries are feasible.
As usual a larger choice of nr_slots typically helps to find better constants and also gives a better chance to cheaper scrambling functions.
In Coding multiway branches using customized hash functions by H.G. Dietz (1992-07) this theme is applied to switch statements, although the multiplication was described as being too slow to be useful (8 cycles at that time); it also might be that the mentioned combination of a multiplication and a following shift was simply overlooked. It is about time that the tests are repeated.
Considering the age of that document, it is really a shame that almost no contemporary compiler makes use of the given results.

We have prepared a document which goes into more detail on this theme.

### Complex hashing

There are some methods which utilize more than one auxiliary hash function and lookup into additional arrays to create a perfect hash function.
Bob Jenkins presents one of the (at runtime) fastest methods using two auxiliary hash functions h1 and h2 such as the ones given above and one array a:

  h(x) = h1(x) ^ a[h2(x)]


Depending on the chosen auxiliary hash functions and context such a method can often outperform imperfect hashing.
Optionally even a minimal perfect hash function can be achieved, i.e. all given numbers are mapped to a contiguous range, albeit usually resulting in larger tables. Unfortunately there is yet no facility to optimize for a near-minimal perfect hash function.

You can download the sources of prototype of a perfect hashing library. This is currently an ugly mixture of C and C++ with only rudimentary documentation and tests. On the other hand it seems to work flawlessly for up to several thousands of entries.
I had an older version already running in a patch for the LLVM compiler suite. Unfortunately I can no longer find my patch online. Also, there has been only little response in the forum; some day I will give it another try…

### Switch and assignment

If all actions consist only of assigning a value to a variable, we evidently can replace the structure's jump addresses by the values and replace the jumps by the assignment. We effectively do this in Using a mapping container which also demonstrates possible realizations in C/C++.

## Dispatching on strings in theory

### Intro

Quite often we want to interpret strings such as parameters or key words which means that we need to recognize strings and do the related actions.
I mainly focus on creating dedicated code dealing with a small to moderate number (up to a few thousand) of constant strings.

The typical solution is to test a given string to the strings to be recognized one by one which is an else-if chain (an O(n) approach). However there are many ways to create faster code than this:

• Use a binary search tree. This is an O(log n) approach. We need a full string comparison (which yields "less", "equal" or "greater") for all non-leaf nodes where we can not profit from comparing the string lengths first, provided we use strings with a stored length. For leaf nodes the question is: "equal" or "not equal"? The more strings we have, the longer the string comparisons itself will take since the length of the common string beginnings will become longer, especially for comparisons near the leaf nodes. Admittedly, this usually becomes relevant only for very large string sets, however sets where many strings have quite long common prefixes are not this rare.
• Calculate a hash function which maps strings to a small range of numbers. This mapping usually needs two steps, from characters to one large number (pre-hashing) and then to the small range as described for integer numbers. Thereafter check the offending string constants with the same hash value (which should be very few) one by one with the usual else-if chain like described above. If these constants are disjunct, one single comparison suffices, i.e. the hash function is perfect. If done properly, hashing is an O(1) approach.

Perfect hashing seems to be the best approach, and thus this is the way I will carry further in this chapter.
Perfect hashing can be achieved e.g. by simple hashing (for very small sets) or Bob Jenkins' method mentioned above. I had good experiences trying both methods in my perfect hashing library.

### Selecting parts

In order to prepare a perfect string hash function you must take at least as much parts (i.e. characters) into account as are necessary to tell all strings of the set apart, and it makes sense to take as few as possible. The string length counts here as one part and usually is the cheapest element, provided one does not use null-terminated strings, however we typically need the length anyway. Unfortunately selecting the minimal/optimal choice is not easy. Many authors simply use a function which takes all characters into account. At least for small sets this seems to be overkill.
The method used by GPERF is to select the absolutely necessary ones first, then add one significant additional part after the other until the resulting excerpts are unique. Now that we have a sufficient selection, all superfluous parts are removed, and as an additional optimization every pair of chosen parts is tested whether it can be replaced by a single one.
As far as I can see, the most usable parts are

1. length (very cheap, probably by-product)
2. character at constant offset from start (needs length check)
3. character at constant offset from end (needs length check)
4. character at relative position (needs calculation using length)

It makes sense to test the actual length for the occurring minimum length min (or length range min..max) beforehand. Having done this, the first and last min characters can be easily accessed without further length checks. The rare case of an empty string being in the set should be treated specially.

The last method has some charm because is does not need conditional jumps and the extra amount of calculation is moderate (factor and shift are very small constants, and no out-of-bounds access is possible even if the multiplication overflows):

s[(unsigned) ((length-1) * factor) >> shift]


### Combining parts

The extracted parts must now be combined to a unique number. We can just try several more or less trivial combining functions (such as a+b, a^b, (a<<3)+b) until we find a suitable one which combines the parts but does not give up the "perfectness".
After combining one part after the other in this way we have the resulting pre-hash value which can be used as input for an perfect integer hashing function.
The resulting hash value can be used as an index for the string table (for comparison), and since perfect hashing was achieved, also for a jump table (switch) or a value table.

### Split by length

It makes sense to split larger string sets by length, i.e. use the length to dispatch to different hash functions. This costs several cycles. However, we then often need much fewer parts and only accesses to constant offsets without range checks; obviously, selecting parts by relative position (type 4) no longer makes sense. The resulting partial perfect hash functions are simpler than one which deals with the whole string set. Each character is read at most twice: One for hashing and one for the final comparison. All in all it may be profitable.

The string table is a concatenation of all partial string tables; an offset to the table index selects the relevant section and the partial hash value the offset within such a section. Optional other tables can be indexed with the same final hash value.

One might try to fill holes within the partial hash value ranges by using them for other partial hash values. In order to faciliate this, it might make sense to use (near) minimal partial perfect hash functions which plant holes at the end of the partial hash value ranges. It helps to place the partial ranges ordered descending by hole count.

If you completely keep apart all length cases (i.e. no hole filling), fixed length comparisons are possible, i.e. you can use cmpmem instead of cmpstr in C, and the comparison loop can even be unrolled since the lenth is a constant; for cases with only a single string, it even gets simpler. On the other hand, this complicates the resulting code for a small benefit.

As a matter of fact, GPERF usually produces code which dispatches by length, albeit only one hash function is generated, i.e. the cases are combined thereafter. This is done there by a switch statement with fall-through cases (ordered descending by length) which incrementally create the hash value.

### Miscellaneous

The described method works without problems with all character sets and all bit depths.

If you want to compare strings case insensitively you could convert all strings to upper or lower case, i.e. the string to be tested at runtime and the set at compile time.
An often working substitute (no umlauts or accented characters; possibly false positives) suitable for ACSII/ANSI/UTF-8/UTF-16 characters is to convert each part to upper case by and-ing them with ~0x20 and use a case-insensitive string comparison afterwards. By clever combining the parts you can often save some ands.

Unfortunately hashing is not really suited to match partial strings such as prefix or postfix strings. General pattern matching is even worse.
Prefix strings can easily be recognized by a decision tree as these are a special case of string ranges. Combining trees and hashing is subject to be explored.

The choice of the hash function also might influence the speed of the subsequent string comparison; if e.g. the hash function only returns the first character, the following string comparison will find that the first character matches.

It is worthwhile to calculate whether a perfect hash function derived by the described way is less expensive than a possibly much cheaper non-perfect one using a choice with only very few parts but taking some extra comparisons into account.

### Aftermath

Looking at the theme from a different angle a compiler should (optionally) try to recognize structures which can be realized by a switch/case construct, i.e. else-if chains, binary search trees, nested switch/case statements. Care should be taken as not to seriously slow down the code for the cases which are preferred in the original code, i.e. the cases which were tested first.
As usual the compiler or other tool should analyze some approaches and choose the best.
Evidently much of the discussion can be applied to other value types as well.

Example: We check for length in range 6..9 and select the character s[(unsigned) ((length-1) * 3) >> 4] (type 4). The resulting value range 70..117 (48 values, range check needed) is compacted into 0..7 (no range check necessary) by

(unsigned) (h * 0x2e5aaa76) >> 29


assuming that unsigned has 32 bit.

Offset Length Choice Hash Compacted
Monday 6M = 77 777
Tuesday 7u = 1171171
Wednesday9e = 1011012
Thursday 8h = 1041046
Friday 6F = 70 705
Saturday 8a = 97 974
Sunday 6S = 83 830
if (6<=length && length<=9) {
unsigned h = s[(unsigned) ((length-1) * 3) >> 4];
h = (unsigned) (h * 0x2e5aaa76) >> 29;
if (strings[h] == s) {  // compare to strings[8]
// Matching string found; now use h to index into table[8]
}
}


Example: We check for length in range 3..9, select the characters s[0] (type 2) and s[length-2] (type 3), and combine them using xor (+ does not work in this case). The combined value could be tested to be in the range 33..56 which are 24 values (range check needed). However we further compact this range into 0..15 (no range check necessary) by

(unsigned) (h * 0x889b53a9) >> 28

Offset Length Choice 1 Choice 2 Hash Compacted
January 7J = 74r = 1145614
February 8F = 70r = 1145211
March 5M = 77c = 9946 8
April 5A = 65i = 10540 5
May 3M = 77a = 9744 7
June 4J = 74n = 11036 3
July 4J = 74l = 10838 4
August 6A = 65s = 1155010
September9S = 83e = 1015413
October 7O = 79e = 10142 6
November 8N = 78e = 1014315
December 8D = 68e = 10133 9
if (3<=length && length<=9) {
unsigned h = (unsigned) s[0] ^ (unsigned) s[length-2];
if (33<=h && h<=56) {
if (strings[h] == s) {  // compare to strings[24]
// Matching string found; now use h-33 to index into table[24]
}
}
}

or:
if (3<=length && length<=9) {
unsigned h = (unsigned) s[0] ^ (unsigned) s[length-2];
h -= 33;
if (h <= 56-33) {
if (strings[h] == s) {  // compare to strings[24]
// Matching string found; now use h to index into table[24]
}
}
}

or (probably better):
if (3<=length && length<=9) {
unsigned h = (unsigned) s[0] ^ (unsigned) s[length-2];
h = (unsigned) (h * 0x889b53a9) >> 28;
if (strings[h] == s) {  // compare to strings[16]
// Matching string found; now use h to index into table[16]
}
}

or even:
unsigned h;
if (((unsigned) (length-3) <= 9-3) &&
(h = (unsigned) s[0] ^ (unsigned) s[length-2],
h = (unsigned) (h * 0x889b53a9) >> 28,
strings[h] == s)) {  // compare to strings[16]
// Matching string found; now use h to index into table[16]
}


Example: We use the same set as above but we split by length. As we can see, we have an additional switch which ought to be realized by a range check and a jump table.
The hole in the hash value range for length 8 and partial hash value 1 in the string table at offset 9 was reused for the single hash value for length 9. No range check on the final hash value is necessary since the table is used up to the last entry.

const string strings[12] = {
// table index length
"May",        //  0 = 0 + 0    3
"July",       //  1 = 0 + 1    4
"June",       //  2 = 1 + 1    4
"April",      //  3 = 0 + 3    5
"March",      //  4 = 1 + 3    5
"August",     //  5 = 0 + 5    6
"January",    //  6 = 0 + 6    7
"October",    //  7 = 1 + 6    7
"December",   //  8 = 0 + 8    8
"September",  //  9 = 1 + 8    9  reused hole
"February",   // 10 = 2 + 8    8
"November",   // 11 = 3 + 8    8
};

// Input: String to be compared (s) and its length (length)
unsigned h;
switch (length) {
case 3:
h = 0 + 0;  // 0..0
break;
case 4:
h = ((s[3] >> 2) & 1) + 1;  // 1..2
break;
case 5:
h = ((s[0] >> 2) & 1) + 3;  // 3..4
break;
case 6:
h = 0 + 5;  // 5..5
break;
case 7:
h = ((s[0] >> 2) & 1) + 6;  // 6..7
break;
case 8:
h = ((unsigned) (s[0] * 0x2a30f7f4) >> 30) + 8;  // 8..11; hole at 1+8
break;
case 9:
h = 0 + 9;  // 9..9, reusing a hole
break;
default:
goto string_not_found;
}
if (strings[h] == s) {
// Matching string found; now use h to index into table[12]
}
else {
string_not_found: ;
// Code for non-matching string
}


## Dispatching on strings in reality

### Pascal

Free Pascal knows how to dispatch on strings, but Delphi does not, see Speeding up case of string (Pascal).

### C / C++

C and C++ can not dispatch on strings, but with some precompiler magic can be taught to, see Switch on strings (C/C++).

### C# / .NET 1.x

Compilers for .NET 1.x (e.g. C#) implement a string switch facility by utilizing a global hash container for "interned strings".
Since all string constants are interned, when queried how the given string is interned, IsInterned returns null (not interned) or a unique constant object reference whose value is fixed and known at least at the first time of executing the string switch code. A dispatching on the references follows (ReferenceEquals). Hence this method effectively resembles the perfect hash method.

### C# / .NET >= 2.0

Compilers for .NET >= 2.0 (e.g. C#) do not (necessarily?) depend on "interned strings" (CompilerRelaxations.NoStringInterning) but feature an about equally fast solution. When the list of cases is short, an else-if chain is used. When the list is long enough, a local hash map is created which maps the string constants to consecutive numbers. Hence this method effectively resembles the hash map container method.

### Java

Since Java SE 7 one can switch on strings.
In Java string constants are always "interned" but the string switch does not really use this feature. I do not understand why there is no method like C#'s IsInterned.

The compiler effectively does only some simple minded code transformations, e.g. (taken from an article of Dawid Weiss)

switch (s) {
case "abc":
code_1();
break;
case "bde":
code_2();
break;
}


=>

String s_tmp = s;
byte index = -1;
switch (s_tmp.hashCode()) {  // lookupswitch, O(log n)
case 96354:  // hash code of "abc"
if (s_tmp.equals("abc"))
index = 0;
break;
case 97379:  // hash code of "bde"
if (s_tmp.equals("bde"))
index = 1;
break;
}
switch (index) {  // tableswitch, O(1); sometimes even lookupswitch
case 0:
code_1();
break;
case 1:
code_2();
break;
}


This is far from optimum! What use is the second switch but to simplify fall-through?! Bizarre!

## Speeding up case of string (Pascal)

The implementation of Free Pascal's case for strings is not more than an else-if chain where the strings are compared one by one; this effectively is only "syntactic sugar". This applies at least to versions up to 2.7.1 (2013-01). It is a shame that Free Pascal creates such poor code.
It is even more a shame that Delphi has no string-case at all, at least up to the version XE3 (2013-01-24) although this language feature was demanded many times.
As you can see in Dispatching on strings in theory, the compiler could do much better.

Where I work we often have quite large string-case constructs. As written above Delphi does not know about a string-case. Therefore as a kludge to speed up our Delphi programs I often pre-dispatch on the first character and then use an else-if chain. This is very simple and reasonably fast (at least considering the simplicity), but still quite readable and maintainable. The goto statements are somewhat ugly but there is no real alternative. This transformation is essentially an example of what the compiler should have done.
I have special cased the empty string because the empty string is represented by a nil pointer; access to any character of it typically results in an access violation. This is true for all dynamic string types.

label _ELSE;
…
if s='' then  // very fast, preventing AV
…
else begin
case s[1] of  // s is non-empty, medium fast
'a': begin
if false then
else if s='abc' then  …
else if s='axx' then  …
else goto _ELSE;
end;
'z': begin
if false then
else if s='zzz' then  …
else goto _ELSE;
end;
else begin
_ELSE:
…
end;
end;
end;


Using a precompiler could also be a solution to problems like this but being forced to always apply an extra compilation step is probably overkill.
Unfortunately Pascal typically has no such thing as a preprocessor.

## Switch on strings (C/C++)

### Intro

Closely related to the previous chapters Dispatching on strings in theory, and Speeding up case of string (Pascal) is the corresponding implementation in C resp. C++ which simulates a switch statement which dispatches on string constants. Basically I want to enhance the usual else-if structure by supplying a easily readable and manageable substitute ("syntactic sugar") and provide a faster execution as well. I do not want to separate the string constants from their usage as case labels or even duplicate the strings. Unfortunately there seems to be no real alternative to using macros or (even worse) a precompiler.

I will concentrate on the following syntax where there can be any number of CASE and one optional DEFAULT at the end.
A C-like break statement is assumed to be implicit and should be left out.
A continue statement is forbidden!
The macros SWITCH, CASE and DEFAULT must be placed on different lines.
This SWITCH construct should be nestable. I have prepared simple test programs testcase.c / testcase.cpp which feature the different methods presented here.

SWITCH(selector)
CASE(const1) statements
CASE(const2) statements
…
DEFAULT statements
END


If multiple CASE should be handled with one statement list, you can use a goto statement or the following macro sequence which is allowed to be placed on one line:

CASE_MULTI(const1)
CASE_OR(const2)
CASE_OR(const3)
…
CASE_DO
statements


Nothing else but CASE_OR is allowed between CASE_MULTI(const) and CASE_DO.
CASE_MULTI(const) can be replaced by CASE_MULTI0 CASE_OR(const).

The built-in preprocessor of C / C++ can not do all of the needed magic of directly creating a program transformation such as the one I have manually done above for Pascal without introducing a considerable amount of overhead and/or limitations on usability, however we can come quite close.
See case_ifc.c and case_ifs.cpp for a trivial else-if implementation.

### Perfect hash by constexpr

If we select a hash function which creates disjunct values for all given strings of a switch statement, we can directly dispatch on the hash value with an integer switch statement and just add one if per case. We might utilize the C++0x feature constexpr to calculate a manifest hash constant from a string constant since unfortunately a simple macro is not sufficient. We could however calculate the hash constant manually … But do you really want to manage that?
The resulting code typically has to deal with a large sparsely filled number range which inhibits a jump table; the best the compiler can do is to use a tree-like jump cascade or possibly use hashing (once more) too.
The main problem here is that you have to guarantee that the hash values are disjunct, i.e. that there are no conflicts; this is called a perfect hash. Fortunately the compiler complains if this condition is not fulfilled; in this case you must choose a different hash function. If you have several hash functions you have to pay attention that you use the same one for all case expressions of one switch.
See case_ce.cpp.

### Using a mapping container

We can collect the string constants during the first run during runtime over the simulated switch statement and use the map created that way in all subsequent runs to feed an integer switch statement. This is possible by using a static local variable.
One problem I see here is to let the compiler generate indexes which span a tight enough number range to allow and suggest the compiler to create a jump table. Obviously we could supply the indexes manually but this is inelegant, error-prone and tedious.
Unfortunately the macro processor is too stupid to enable the creation of an incrementing macro which can be used from within an other macro. See the free boost library for the effort it takes to even create a counting facility using an include file in boost/preprocessor/slot/counter.hpp. The macro stuff of e.g. NASM does not pose such problems…
As a poor substitute we might use __LINE__ and have to bite some bullets: The numbers do not start at 0 (we can compensate this), we must have each CASE on a different line, and we will get holes in the number range if we do not put each CASE in a single line. Almost always however this approach is good enough to get the job done.
BTW: Should Microsoft's MVS complain that __LINE__ is not constant, according to a forum post you should change the compiler setting from "Program Database for Edit and Continue" to "Program Database". Bizarre!
See case_hm.cpp for a solution using an STL container.

The interesting part of this approach is that - once we have gathered all string constants - we can put some effort in optimizing the mapping algorithm by selecting the best available (i.e. fastest and eventually perfect) hash function or even by using a completely different mapping algorithm. This is possible and makes sense because it will be evaluated at most once per program run and the effort typically is amortized in subsequent executions of the SWITCH, provided that the SWITCH is executed often enough. By doing this we can easily outperform the solution which uses the STL container.
Some precautions should be taken to protect this first pass from being executed simultaneously multiple times in a multi-threaded program, see Singletons / Do once only.
The load factor α of the hash container currently is 0.5…1.0 (50%…100%) which I think is a reasonable choice. For a perfectly randomizing hash function the number of expected string comparisons is about 1+α/2, and thus 1.25…1.5.
See case_map.c / case_map.cpp for a self-made container featuring a hash optimizer. The variant case_ms.cpp uses STL strings instead of 0 terminated character arrays. This can improve most hash functions since the string lengths are very "cheap" to access; also more very simple hash functions can be created where e.g. the length itself and/or the last character(s) are used.

Unfortunately we can not really expect that a compiler can optimize all the things of the first pass we now do at runtime into a static structure containing the optimized hash map and that the call to the selected hash function becomes static or even inlined…

### Using lambdas

We also could abandon with the integer switch statement as the implementation base completely by putting the statements of the CASE bodies directly into the mapping container using the C++0x feature lambda. This is very similar to the approach above.
The boost library defines such a construct in boost\lambda\switch.hpp.
Here you can find another implementation using a similar variant.
The approach theoretically looks nice on first sight but I was disappointed when I saw the resulting code. Also each body is a separate function thereby forbidding the goto statements mentioned above needed to pipe multiple cases into one statement list. For these reasons I will not carry this further.

### Afterword

Similar as for Pascal using a precompiler could also be a solution to problems like this but being forced to always apply an extra compilation step is probably overkill.
The kludge using a self-made hash container given by case_map.c / case_map.cpp / case_ms.cpp is a reasonable solution to the string switch problem.
All the overhead performed during runtime including eventually needed locking mechanisms could easily be eliminated and done at compile time instead, if string-switches were a built-in language feature. Unfortunately even properly implemented strings are missing…

## Counting loops in C/C++

Have you ever tried to program a loop in C which works correctly in all cases, especially including empty ranges and the complete range of a type such as 0…255 for unsigned char without using a bigger type?
Your first attempt probably resulted in an endless loop such as this:

unsigned char i;
for (i=0; i<=255; ++i) {…}


Here obviously i<=255 is always true.
The problem here is that C's for loop essentially is only a while loop with an additional initialization and an incrementing part, i.e. only syntactic sugar.
While being quite flexible this construction is much more clumsy and error-prone than Pascal's and Modula's for statement: The looping variable must be written at least 3 times and you must pay attention to use the correct relation operator which must match the incrementing or decrementing step.
As far as I can see there is no general solution using only one simple conditional expression; also a simple while loop is unable to iterate over the whole range.

I assume that the ending value is constant for the whole loop. The looping variable should be considered read-only in the loop body and undefined elsewhere. We do not want to duplicate the loop body.
One valid general solution for a loop on v using type t over values from s to e is the following:

t e = …
t s = …
if (s <= e) {
t v=s;
while (true) {
…
if (v == e)
break;
++v;
}
}


It would be nice if we could find a macro which does the stuff. Unfortunately a single macro forces us to pack the statement list into one macro parameter, i.e.

FOR(t,v,s,e,
statement_list
)


This not only looks strange, i.e. ending a statement list with ")", but also trashes line numbering for these statements if multiple lines are used. In case of a syntax error the compiler would not give us a helpful line number. Also this effectively disables the some of the macros mentioned in Switch on strings (C/C++) where line numbers are part of the inner magic.

A pair of matching macros

FOR1(t,v,s,e)
statement_list
FOR2(v)


obviously does the job:

#define FOR1(t,v,s,e) \
{ \
t v,_e_; \
v = (s); \
_e_ = (e); \
if (v <= _e_) { \
while (true) { \
{

#define FOR2(v) \
} \
if (v == _e_) \
break; \
++v; \
} \
} \
}


This however is ugly because we have to provide the looping variable twice. We also must pay attention that these macros match, especially when we want to provide macros for incrementing and decrementing loops.

Much better is a looping macro and a generalized END macro which does not need any parameters, e.g.:

FOR(unsigned char, v, 0, 255)
statement_list
END


This can be realized as follows:

#define FOR(t,v,s,e) \
{ \
t _v_, _e_; \
_v_ = (s); \
_e_ = (e); \
if (_v_ <= _e_) { \
switch (0) { \
while (_v_ != _e_) { \
++_v_; \
default: \
{ \
const t v = _v_;

#define END }}}}}


Here I give up some simplicity and possibly some speed to achieve the goal.

You can find this macro and other related macros for Modula-style control flow statements in modula.h and a small test program for the loop macros in testloop.c.
By the way: The many seemingly superfluous braces are needed in order to be able to use the same END macro for this FOR macro, the string switch macros and all other related macros. You will find their formal grammar in modula.txt.

## Dealing with endian

For data exchange purposes it is necessary to be explicit about the endian (i.e. byte order) of stored values. I have often seen the usage of converting routines which do a byte swap if needed; as far as I can see this is error-prone and not very elegant.
To cope with the problem in an easy way we often use special objects/records/structs/classes which make the endian explicit; the access thereof is done by methods which perform the byte swaps; doing so makes it nearly impossible to use them in a wrong way. Additionally the source code directly shows the endian in the definition of the transfer record.
Several such records (such as a 3 byte Motorola type for scanner usage, a reminiscent of SCSI) can easily be combined into byte packed records which can be directly exchanged with the file I/O system. Obviously it is often necessary to force byte-wise structure packing with a compiler directive.
You should always use the same interface for all endian types. It might make sense to force the exclusive usage of such types in exchange records as a company policy.

You can find a simple example in endian.*.

Free Pascal defines one of these symbols:

• ENDIAN_LITTLE Defined when the Free Pascal target is a little-endian processor (80x86, Alpha, ARM).
• ENDIAN_BIG Defined when the Free Pascal target is a big-endian processor (680x0, PowerPC, SPARC, MIPS).

Obviously this method to divide the storage from the access can easily be extended to other uses as well such as dealing with uncommon floating point formats.

## Singletons / Do once only

If you want to acquire a singleton or need to build a structure only once as the hash maps in Switch on strings (C/C++), the simplest solution is this:

bool ready = false;
…
// Acquire singleton / build structure
}


However this is not at all thread-safe. Instead you may use the following locking strategy (the first if should be as fast as possible):

volatile bool ready = false;
…
// Enter critical section
// Acquire singleton / build structure
// Set a store barrier
}
// Leave critical section
}


This is the fastest method I could find. Unfortunately this pattern is not without problems, especially on multi-processor machines, which are related to memory access ordering and caching.
volatile is necessary to force ready to be always be fetched from memory, because otherwise other threads have no chance to see a change in the second if, when ready is cached in a register; moreover the second if could be optimized away.
The critical section and the second if is necessary to be thread-safe.
The memory barriers are necessary when there are multiple processors which use out-of-order memory accesses. For many systems these barriers are not necessary.

## Manipulating rightmost bits

Assuming that numbers are processed in two's complement representation (i.e. -1 = ~0) several simple operations are possible which look for the rightmost clear / set bit.
Unfortunately there is no similar set of operations for the leftmost bits. The (yet?) only x86 instruction which can mirror bits is AMD's new XOP instruction VPPERM.
As an example the source bit pattern …x01… stands for an arbitrary number of bits (possibly none) followed by a 0 bit followed a contiguous row of 1 bits (possibly none); this means that we have to look for the rightmost 0 bit.
There is e.g. an operation …001… which zeros out all the x bits [x & ~(x+1)].
Another operation …~11… inverts all x bits and toggles the rightmost 0 bit [~x | (~x-1)].
For some of the non-trivial operations there are special instructions for newer x86 processors which are from the sets BMI1 ("bit modification instructions") and TBM ("trailing bit modification", currently AMD only).
The given x86 instruction sequences modify the eax register.
Please note that the table sports hints when you move your mouse over the fields.

To illustrate a concrete example we take the binary number 1011 (decimal 11) and apply the INC instruction (…x01… => …x10…).
We locate the rightmost 0 bit (1011) and then invert this 0 and all 1 bits to the right of it; we get 1100 (decimal 12).
Let us repeat this. We locate the rightmost 0 bit 1100 and then invert this 0 and all 1 bits to the right of it (there are none); we get 1101 (decimal 13).
Indeed, this function simply increments the given number.

Source Target Mode Instruction Set Formula(s) Implementation in x86 assembler 0 -1
…x01…

Search
for
lowest
0 bit
…000…0x00 MOV   0 xor eax,eax / mov eax,0 00
…001…0x01     x & ~(x+1); x & (~x-1) lea edx,[eax+1]; not edx; and eax,edx 0-1
…010…0x02 BLCIC TBM ~x & (x+1) lea edx,[eax+1]; not eax; and eax,edx 10
…011…0x03 BLCMSK TBM x ^ (x+1) lea edx,[eax+1]; xor eax,edx 1-1
…100…0x04     ~(x ^ (x+1)); x ^ ~(x+1) lea edx,[eax+1]; xor eax,edx; not eax -20
…101…0x05 BLCI TBM x | ~(x+1); x | (~x-1) lea edx,[eax+1]; not edx; or eax,edx -2-1
…110…0x06 T1MSKC TBM ~x | (x+1) lea edx,[eax+1]; not eax; or eax,edx -10
…111…0x07 MOV   -1 or eax,-1 / mov eax,-1 -1-1
…x00…0x08 BLCFILL TBM x & (x+1) lea edx,[eax+1]; and eax,edx 00
…x01…0x09 -   x - 0-1
…x10…0x0a INC   x+1 inc eax 10
…x11…0x0b BLCS TBM x | (x+1) lea edx,[eax+1]; or eax,edx 1-1
…~00…0x0c     ~(x | (x+1)); ~x & ~(x+1) lea edx,[eax+1]; or eax,edx; not eax -20
…~01…0x0d     ~x-1; ~(x+1); -2-x not eax; dec eax -2-1
…~10…0x0e NOT   ~x not eax -10
…~11…0x0f     ~(x & (x+1)); ~x | ~(x+1) lea edx,[eax+1]; and eax,edx; not eax -1-1
…x10…

Search
for
lowest
1 bit
…000…0x10 MOV   0 xor eax,eax / mov eax,0 00
…001…0x11 TZMSK TBM ~x & (x-1); ~(x | -x); (x & -x)-1 lea edx,[eax-1]; not eax; and eax,edx -10
…010…0x12 BLSI BMI1 x & -x; x & (~x+1) mov edx,eax; neg eax; and eax,edx 01
…011…0x13 BLSMSK BMI1 x ^ (x-1); -x ^ ~x lea edx,[eax-1]; xor eax,edx -11
…100…0x14     x ^ -x mov edx,eax; neg eax; xor eax,edx 0-2
…101…0x15 BLSIC TBM ~x | (x-1) lea edx,[eax-1]; not eax; or eax,edx -1-2
…110…0x16     x | -x mov edx,eax; neg eax; or eax,edx 0-1
…111…0x17 MOV   -1 or eax,-1 / mov eax,-1 -1-1
…x00…0x18 BLSR BMI1 x & (x-1) lea edx,[eax-1]; and eax,edx 0-2
…x01…0x19 DEC   x-1 dec eax -1-2
…x10…0x1a -   x - 0-1
…x11…0x1b BLSFILL TBM x | (x-1) lea edx,[eax-1]; or eax,edx -1-1
…~00…0x1c     ~(x | (x-1)); -x & ~x lea edx,[eax-1]; or eax,edx; not eax 00
…~01…0x1d NOT   ~x not eax -10
…~10…0x1e NEG   -x neg eax 01
…~11…0x1f     ~(x & (x-1)); -x | ~x lea edx,[eax-1]; and eax,edx; not eax -11

I assigned the operation modes in order to demonstrate additional properties. Let us define the function tbm(x, mode) where mode is the operation mode. You will find an implementation in perm_bas.*.

Generally the following holds:

tbm(x, mode) = tbm(x, mode & 0x11) | tbm(x, mode & 0x12) | tbm(x, mode & 0x1c)


For (mode & 0x08) == 0 (upper bits are filled with 0 or 1) the following holds:

tbm(x, mode) = ~tbm(x, mode ^ 0x07)
tbm(x, mode) = tbm(~x, mode ^ 0x10)


For (mode & 0x08) != 0 (upper bits are kept or inverted) the following holds:

tbm(x, mode) = ~tbm(~x, mode ^ 0x13)
tbm(x, mode) = tbm(~x, mode ^ 0x14)
tbm(x, mode) = tbm(x+q, mode ^ 0x10)


Here q = 1 for (mode & 0x10) == 0 and q = -1 for (mode & 0x10) != 0.

## Iterators

Would not it be nice to be able to externalize the logic of looping or other "framing routines"?
There is a solution: iterators! It is a shame that this simple mechanism has not caught on but instead is usually mimicked by inferior and/or complicated constructions; try to implement an iterator which steps a tree.
Life could be so simple. Caution: Sarcasm present in this section.

Once upon a time there were some clever folks who designed ALGOL 58 aka IAL (1958). This language proposed a special parameter passing method which was called pass-by-name and allowed to evaluate parameters in the context of the caller whenever the called procedure wanted to do so. At least in ALGOL-60 (1960) this feature really worked. Some gurus such as Donald E. Knuth liked it but it seems that most programmers have not understood its striking value. As it is very simple to abuse this feature, especially when using it to also write to parameters, it became discredited and consequently was more or less banned from subsequent languages...

A different approach specialized for loops was implemented in Alphard and a bit later in CLU (1975). The idea was to provide a special loop statement where most of the loop control was performed by an "iterator".

The C++ STL features iterators which are somewhat flexible but neither simple nor are intuitively or fool-proof to use.
The delegates of C++11 do the intended job but are also neither simple nor readable nor produce perfect code.
C/C++ does not support nested functions. MetaWare aka Synopsys however demonstrated a collection of compilers (C/C++/Pascal) which supported nested functions and iterators probably since 1985.
The committees seem to shy away from introducing simple things like nested functions because "nobody needs these and it can be emulated anyway", also, communities like BOOST seem to prefer hacks to clean solutions in the very same spirit. This is IMHO programming with pincers and hammers.

C# also features iterators but these can not delegate the yield to other routines nor can act recursively. Instead of doing what is needed in a straight-forward way, they do a monstrous job of code morphing, thereby generating a class and a state machine. Bizarre!
The delegates of C# do the intended job like the ones of C++ but are also too complicated.
There must be a reason why some containers such as lists offer a dedicated iterator method such as ForEach which is much faster and simpler than the foreach language feature...
C# does not support nested functions.

ISO Pascal features procedural parameters capable of transferring nested procedures for a long time. Also, Free Pascal supports nested procedures as parameters since version 2.6.0. Unfortunately Borland/Embarcadero did not implement them until now although their Turbo-Vision (1990) sketched iterators by using nested procedures in an unsafe way.

IMHO the solution was present almost from the very beginning but maybe the necessities were not really understood correctly. All that is needed is the ability to transfer a block of code with the meaning of a nested procedure; evidently this block should have access to all variables as usual.
The potential looping should not be thought to be the duty of the calling code, hence break is not an option. Also, there is no need to generate any object, capture any variable or to switch between contexts or even coroutines; simple function calls are sufficient; needless to say, the appropriate scope parameter must be provided as is usual when dealing with nested functions.

Several years ago I published a working emulation of iterators for Delphi on the Internet but got almost no feedback.
I also posted a feature request on Borland's site. Same result.
Nevertheless this emulation has become an essential tool in code of the company I work for and proved to greatly simplify life by avoiding complicated alternatives.