Interesting stuff and I see that Nathan commented there as well  where he observed "what matters is that the store-to-load forwarding does not try to execute in the same cycle.". The Agner Fog paper on Intel Micro-architecture  is probably the most relevant in terms of puzzling it out.
Again the reason I suspect the branch predictor in this sort of case is that when the loops are essentially 100% inside the cache, practically the only thing that varies the actual execution rate is whether or not a branch is not predicted. That said, the flow through these sorts of pipelined execution units is anything but clear.
They are decent, but don't seem quite as good as the ones on AMD's (or even more Intel's) processors.
Qualcomm Centriq 2434 [https://www.nextplatform.com/2017/11/08/qualcomms-amberwing-...]:
- 40 cores (no SMT)
- 2.5 GHz peak
- 4 uops/instructions per cycle [https://www.qualcomm.com/media/documents/files/qualcomm-cent...]
- 110 W TDP
- 10 Guops/s/core
- 0.011 Guops/s/core/$
- 400 Guops/s
- 0.45 Guops/s/$
- 3.63 Guops/s/W
AMD Epyc 7401P [https://en.wikipedia.org/wiki/Epyc]:
- 24 cores (2x SMT)
- 2.8 GHz all-core boost
- 6 uops per cycle [http://www.agner.org/optimize/microarchitecture.pdf]
- 170 W TDP
- 16.8 Guops/s/core
- 0.016 Guops/s/core/$
- 403 Guops/s
- 0.37 Guops/s/$
- 2.37 Guops/s/W
So based on this the AMD processor has 170% the Qualcomm's per-core performance, equal on total throughput, 83% of Qualcomm's total thoughput per $ and 65% of Qualcomm's total throughput per W.
Note that the AMD CPU has SMT while the Qualcomm doesn't which improves utilization, and its components are probably faster (due to higher TDP and more experience making CPUs), so it looks like the AMD CPUs are likely to be strictly better in practice except possibly on performance/watt.
Also, with AMD/Intel, albeit at much lower performance/$, you can have 32-core instead of 24-core CPUs and there is up to 4/8-way SMP support that Qualcomm doesn't mention.
Many if not most HPC/sci. comp applications are memory bound (or actually their implementations are). [ref missing, but google around and you'll find plenty]
More and more applications are drifting into the memory-bound regime, especially with the wider SIMD instruction sets increasing arithmetic throughput while memory throughput lags behind.
My back-of-the-envelope calculation (with a guesstimated AVX512 clock) gives a 12 FLOPS/byte for a big Skylake chip like the 8176 while this was around 9 FLOPS/byte for Broadwell. I'm not entirely sure about the instruction throughput of Zen, but it looks like the 7601 should be around 4-5 FLOPS/byte (that's worst-case with mixed FMA+ADD workload based checked Agner F's manual  IIUC).
Of course this does not consider NUMA and other effects, but given the above a lot of applications will benefit from the great bandwidth advantage of EPYC.
Agner on microarchitecture , page 213, another mention as a bottleneck on page 216:
The Ryzen supports the AVX2 instruction set. 256-bit AVX and AVX2 instructions are split into two µops that do 128 bits each.
AVX2 increased register width from 128-bit (AVX) to 256-bit, yet Ryzen cores can only process them 128-bit at a time. There is more to AVX2 than just width but that means in comparison to Intel processors, which can do the full 256-bit in a µop, the Ryzen throughput will suffer in tests that heavily emphasize AVX2 instructions (think video encoding).
600% faster is absolutely unbelievable. According to Agner's observations, this should not be true at all: http://www.agner.org/optimize/microarchitecture.pdf.
The Ryzen architecture apparently has half throughput for 256-bit vectors instructions compared to 128-bit vector instructions, while Skylake (no data for Kaby Lake yet) has full throughput for 256-bit vectors. Since the 1800X has twice the cores, it should be about the same throughput if all cores are used, so if it isn't multi-threaded, worst case should be half, no?
This guy also seems a bit biased »it has to throttle itself down below the base clock to prevent itself from imploding -- just like their GPUs!«.
There is also a benefit to using 256-bit instructions, contrary to what he says: They're more dense than having 2 separate 128-bit instructions.
I wouldn't call it a bold-faced lie either, it's not like they said »Using 256-bit vector instructions is much faster than using 128-bit vector instructions!«.
I think that's because (at least on some CPUs) it takes three macro-operations. http://www.agner.org/optimize/microarchitecture.pdf (section 17.4, page 188):
"Vector path instructions are less efficient than single or double instructions because they require exclusive access to the decoders and pipelines and do not always reorder optimally. For example:This sequence takes 4 clock cycles to decode because the vector path instructions must decode alone."
; Example 17.1. AMD instruction breakdown xchg eax, ebx ; Vector path, 3 ops nop ; Direct path, 1 op xchg ecx, edx ; Vector path, 3 ops nop ; Direct path, 1 op