This is some really great analysis, very impressive!
The "AMD spike" is very interesting. It's possible that their prefetch policy varies based on the delta between cache line accesses:
- At stride 16, we have a cache line delta of 1 (basically every access is on an adjacent cache line)
- At stride 32, we have a cache line delta of 2
Between these, it might just be using a simple prefetch scheme that always fetches adjacent cache lines - so you would end up with a many wasted pre-fetches.
Beyond 32, when the misses are never seen on adjacent cache lines (delta >= 2) they might be using a more advanced prefetch scheme (stride based, etc.) - that might improve the runtime.
(This still doesn't convincingly talk about what I described - I'm guessing as well)
By the way, have you considered trying to disable the hardware prefetcher? (I think you can do it from the BIOS settings) That might give more predictable results.
Thanks! We did notice the cache control instructions, but since this article was to do with suggesting improvements to common programming paradigms we thought it best to stick to scenarios that could be plausibly occur general codebases. Prefetch instructions would probably be too difficult to incorporate into complex programs (especially accounting for different hardware).
About the Zen5, your hypothesis is interesting, and it just goes to show how complex CPUs are today!
This is some really great analysis, very impressive!
The "AMD spike" is very interesting. It's possible that their prefetch policy varies based on the delta between cache line accesses:
- At stride 16, we have a cache line delta of 1 (basically every access is on an adjacent cache line)
- At stride 32, we have a cache line delta of 2
Between these, it might just be using a simple prefetch scheme that always fetches adjacent cache lines - so you would end up with a many wasted pre-fetches.
Beyond 32, when the misses are never seen on adjacent cache lines (delta >= 2) they might be using a more advanced prefetch scheme (stride based, etc.) - that might improve the runtime.
They talk a little about prefetching in this document - https://docs.amd.com/v/u/en-US/58455_1.00
(This still doesn't convincingly talk about what I described - I'm guessing as well)
By the way, have you considered trying to disable the hardware prefetcher? (I think you can do it from the BIOS settings) That might give more predictable results.
Another interesting result could be to try __builtin_prefetch() GNU extension (https://www.daemon-systems.org/man/__builtin_prefetch.3.html) - I don't know if this would be better than the "warmup" runs that you have tried.
But great work overall! Had fun thinking through this.
Thanks! We did notice the cache control instructions, but since this article was to do with suggesting improvements to common programming paradigms we thought it best to stick to scenarios that could be plausibly occur general codebases. Prefetch instructions would probably be too difficult to incorporate into complex programs (especially accounting for different hardware).
About the Zen5, your hypothesis is interesting, and it just goes to show how complex CPUs are today!
This is fascinating! Thank you 🙏