There have been a lot of Ryzen 3 or EPYC 2 discussions recently, so I figured that the community may be interested in the official documentation for how to optimize these processors.
The lingo is a bit hard to get into if this is your first optimization manual, but its terse and a relatively short read. If any "beginners" wish to tackle this text, I suggest Agner Fog's tutorials as starter material. https://www.agner.org/optimize/microarchitecture.pdf
My feeling is that the quality of software engineering has gone up dramatically in the past decade.
> It feels like a lot engineers now days don't seem to have a good cs background. They don't see to understand things like cache, paging, virtual memory, cpu pipelines, algorithms or other things pretty fundamental to CS.
Pipelines and caches and paging and virtual memory are stupidly complicated in modern processors. If you claim to understand these things and you don't either work at the company or have an NDA with the company so that you can implement drivers, you're probably full of shit.
What I can't stand are the "highly-ranked" schools that introduce students to a very basic and abstract (and outdated) notion of these topics, and students enter the workforce overconfident that they have understood the topic. You haven't understood the topic, and having some rough notion of the topic can often times be worse than if you didn't know anything at all.
tl;dr: Modern processors are proprietary IP and you should be skeptical of anyone who claims to deeply understand it but doesn't work for the company making it. You do not need to understand how one works to be a great software engineer.
I think the tech press still tells us what they can, and stuff like execution ports, reorder windows, etc. is still publicly disclosed. AT talked about what was publicly said about Zen 2 (https://www.anandtech.com/show/14525/amd-zen-2-microarchitec...) and Sunny Cove (https://www.anandtech.com/show/14514/examining-intels-ice-la...). And their reviews do try to report the top observable results (memory latencies, relative performance on different kinds of task, power/clock info) and all that's arguably of more practical importance to lots of folks anyway.
There's also just the trend of modern designs being tricky enough it's harder to infer as much about them and harder to write accessibly about what you do know; it's not super easy to figure out and describe, say, modern branch predictors simply because they're all layering a lot of strategies on each other.
For example, from Haswell on, Agner Fog essentially said Intel's large-core branch predictors are good at lots of things but there's not much he can say about how they work (p29 at https://www.agner.org/optimize/microarchitecture.pdf). Writing code to beat Cortex-A76 prefetchers, AT's Andrei Frumusanu had difficulty fooling them with anything other than essentially-random access patterns and compared them to "black magic" (https://twitter.com/andreif7/status/1102230575522430977). These aren't just random folks saying "wow, CPUs are complicated"; they successfully figured lots of stuff about past generations of CPU.
AMD did reference the TAGE family of branch predictors, which there's lots about in public literature. There might be some broadly interesting stuff in the vendors' contributions to gcc/LLVM (machine models and arch-specific optimizations).
Maybe ARM implementors talk a little more about their stuff? That might have something to do with the dynamics of the relatively open/diverse market for ARM SoCs versus the long-running one-on-one-ish x86 rivalry.
Hard to boil all that down to a single point, but if AMD and Intel want to talk more about the guts of their products, I'm sure plenty of grateful wonks would lap it up. :)
Agner Fog's optimization guide is worth a very close reading: https://www.agner.org/optimize/microarchitecture.pdf
If you want smaller chunks, Denis Bakhvalov is starting to hit his stride with a nice series of blog posts: https://easyperf.net/notes/
If that's not enough, Ingve (who submitted this) has made literally thousands of submissions, and a surprising number are excellent articles on optimization: https://news.ycombinator.com/submitted?id=ingve
It's not usually a big difference, but on modern Intel there is a some difference in performance between the different CMP/JCC options. The more common ones will "fuse" with the CMP and be executed as a single µop, but the rarer ones (like JO and JS) do not fuse with CMP, and thus can add a cycle of delay (and have the overhead associated with executing another µop). The optimization is called "macro op fusion". Details here https://en.wikichip.org/wiki/macro-operation_fusion and here https://www.agner.org/optimize/microarchitecture.pdf (pages 108 and 125).