This is a poor comparison of performance. All of these networks are CNNs, and very old architectures at that. They are all probably memory bottlenecked which is why you see the consistent 50% improvement in FP32 perf.
It is also not clear what batch sizes are being used for any of the tests. If you switch to FP16 training, you must increase the batch size to properly utilize the Tensor Cores.
If you compare these cards at FP16 performance on large language models (think GPT-style with large model dimension), I am confident you will see Titan RTX outperform the 3090. The former has 130 TF/s of FP16.32 tensor core performance while the latter has only 70 TF/s.