Why Does -march=native Cause Crashes on My Zen2 Chip?

  • Thread starter Thread starter Vanadium 50
  • Start date Start date
AI Thread Summary
The discussion revolves around the behavior of the GCC compiler's march and mtune options on a Zen2 chip. It highlights that mtune appears to have minimal impact on performance, especially when using aggressive optimization flags like -O3 and -Ofast, suggesting that the code may already be highly optimized. The default march option works adequately, and specific settings like -march=znver1 and -march=native do not yield significant performance improvements, with both potentially causing hangs during pthread_create. The user notes that the march=native setting should not produce incompatible code for the CPU. The inquiry into performance stems from observing that a specific block of code, involving multiply-and-add operations, could benefit from FMA instructions, prompting the exploration of compiler switches for optimization. Overall, the findings indicate that under certain conditions, the compiler's tuning options may not lead to noticeable performance gains.
Vanadium 50
Staff Emeritus
Science Advisor
Education Advisor
Gold Member
Messages
35,003
Reaction score
21,704
I don't understand the march/mtune behavior.

This is Linux GCC, on a Zen2 chip.

Mtune seems to do nothing much. OK, sometimes your code is as tuned as its going to get out of the box. I am compiling with -O3 and -Ofast, so maybe it is so optimized there is little to tune.

Default march works fine. -march=znver1 works fine, but again, no faster. OK, again, maybe there's little to be done. -march=native and -march=znver2 should do the same thing. I guess they do. They both hang at pthread_create.

This is not really a problem - I don't really need to squeeze the last bin of performance out of the code - but it sure seems mysterious.

gcc -v gives

Code:
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-host-pie --enable-host-bind-now --enable-languages=c,c++,fortran,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugs.almalinux.org/ --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-plugin --enable-initfini-array --without-isl --enable-multilib --with-linker-hash-style=gnu --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_64=x86-64-v2 --with-arch_32=x86-64 --build=x86_64-redhat-linux --with-build-config=bootstrap-lto --enable-link-serialization=1
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.4.1 20231218 (Red Hat 11.4.1-3) (GCC)
 
Computer science news on Phys.org
Vanadium 50 said:
Mtune seems to do nothing much. OK, sometimes your code is as tuned as its going to get out of the box. I am compiling with -O3 and -Ofast, so maybe it is so optimized there is little to tune.
If you leave out those optimization options, does mtune seem to do more?
 
I have not played with that. I am more interested in my march crashes. The setting march=native should never generate code that the CPU cannot handle.

The way I got into this rabbit hole was noticing that the block of code that takes the longest has some multiply-and-adds. This was to see if the compiler could speed it up by using FMA instructions. What could be simpler than throwing a compiler switch?
 
Well, the date has now passed, and Windows 10 is no longer supported. Hopefully, the readers of this forum have done one of the many ways this issue can be handled. If not, do a YouTube search and a smorgasbord of solutions will be returned. What I want to mention is that I chose to use a debloated Windows from a debloater. There are many available options, e.g., Chris Titus Utilities (I used a product called Velotic, which also features AI to prevent your computer from overheating etc...
I have been idly browsing what Apple have to offer with their new iPhone17. There is mention of 'Vapour cooling' to deal with the heat generated. Would that be the same sort of idea that was used in 'Heat Pipes' where water evaporated at the processor end and liquid water was returned from the cool end and back along a wick. At the extreme high power end, Vapour Phase Cooling has been used in multi-kW RF transmitters where (pure) water was pumped to the Anode / or alternative Collector and...
Back
Top