Discussion about this post

User's avatar
Bharath Suresh's avatar

This is some really great analysis, very impressive!

The "AMD spike" is very interesting. It's possible that their prefetch policy varies based on the delta between cache line accesses:

- At stride 16, we have a cache line delta of 1 (basically every access is on an adjacent cache line)

- At stride 32, we have a cache line delta of 2

Between these, it might just be using a simple prefetch scheme that always fetches adjacent cache lines - so you would end up with a many wasted pre-fetches.

Beyond 32, when the misses are never seen on adjacent cache lines (delta >= 2) they might be using a more advanced prefetch scheme (stride based, etc.) - that might improve the runtime.

They talk a little about prefetching in this document - https://docs.amd.com/v/u/en-US/58455_1.00

(This still doesn't convincingly talk about what I described - I'm guessing as well)

By the way, have you considered trying to disable the hardware prefetcher? (I think you can do it from the BIOS settings) That might give more predictable results.

Another interesting result could be to try __builtin_prefetch() GNU extension (https://www.daemon-systems.org/man/__builtin_prefetch.3.html) - I don't know if this would be better than the "warmup" runs that you have tried.

But great work overall! Had fun thinking through this.

Babbage's avatar

This is fascinating! Thank you 🙏

1 more comment...

No posts

Ready for more?