Dec 27, 2020

Instruction compression dates at least to MIPS16, 1996 [1]. Macro-op fusion dates at least to the Pentium M, to 2003 [2]. These are old ideas.

[1] https://en.wikipedia.org/wiki/MIPS_architecture

[2] https://www.agner.org/optimize/microarchitecture.pdf

Jun 24, 2020

You need to look at the disassembly of the generated binary to make sense of this sort of performance variation (paying attention to line cache boundaries for code and data), and even so, it is highly non-trivial. The performance counters found in modern processors sometimes help (https://en.wikipedia.org/wiki/Hardware_performance_counter ).

https://www.agner.org/optimize/microarchitecture.pdf contains the sort of information you need to have absorbed before you even start investigating. In most cases, it's not worth acquiring the expertise for 5% one way or the other in micro-benchmarks. If you care about these 5%, you shouldn't be programming in C in the first place.

And then there is this anecdote:

My job is to make tools to detect subtle undefined behaviors in C programs. I once had the opportunity to report a signed arithmetic overflow in a library that its authors considered, rightly or wrongly, to be performance-critical. My suggestion was:

… this is not one of the subtle undefined behaviors that we are the only ones to detect, UBSan would also have told you that the library was doing something wrong with “x + y” where x and y are ints. The good news is that you can write “(int)((unsigned)x + y)”, this is defined and it behaves exactly like you expected “x + y” to behave (but had no right to).

And the answer was “Ah, no, sorry, we can't apply this change, I ran the benchmarks and the library was 2% slower with it. It's a no, I'm afraid”.

The thing is, I am pretty sure that any modern optimizing C compiler (the interlocutor was using Clang) has been generating the exact same binary code for the two constructs for years (unless it applies an optimization that relies on the addition not overflowing in the “x + y” case, but then the authors would have noticed). I would bet a house that the binary that was 2% slower in benchmarks was byte-identical to the reference one.

May 29, 2020

> setcc actually does depend on the previous value of the register, because it only comes in the low-byte variants

You should be able to break the dependency and avoid the partial register stall by doing: movzx eax, ax

See: https://stackoverflow.com/questions/41573502/why-doesnt-gcc-... ; https://software.intel.com/en-us/forums/intel-isa-extensions... ; https://www.agner.org/optimize/microarchitecture.pdf (section 6.8)

Apr 25, 2020

Great, I'll take a look in a bit, although it might take me until tomorrow to have time to do much with it.

In the meantime, I'll mention that my first quick discovery is that clang seems to be significantly faster than gcc on the standard C code. The ratio changes with different versions and compilation options, but on Skylake with "-Ofast -march=native" I find clang-6.0 to be almost twice as fast as gcc-8. So if you have clang installed, check and see if it might be a better baseline.

Also, what system are you running? This shouldn't make a difference with execution speed, but will make it easier to make tool suggestions. If you are running some sort of Linux, now would be a good time to get familiar with 'perf record'!

Edit: > Intel(R) Pentium(R) CPU N3700 @ 1.60GHz Hmm, that's a "Braswell" part, which unfortunately isn't covered in Agner's standard guide to instruction timings (https://www.agner.org/optimize/microarchitecture.pdf) and I'm not familiar with it's characteristics. This might make profiling a little more approximate.