Besides changing optimization flags (e.g. -O3
) and enabling LTO, the most
important thing you can do to optimize C++ programs is to use PGO or FDO. In
2025 I think the best option is AutoFDO
which I need to blog about since the existing documentation is lacking, so it
seems like only companies with LLVM developers (i.e. FAANG companies) have the
expertise to actually use AutoFDO. Without divulging too much proprietary
information, using AutoFDO increases the speed of code at work by nearly 10%
compared to just compiling with -O3
and ThinLTO. This is a huge
optimization.
The short version of how FDO works is you have a workflows that use
perf-record(1)
(or potentially perf_event_open(2)
if you're adventurous) to
periodically sample branches taken. You can even run this in production (say
0.1% of the time) and then merge sample data. These merged samples get fed into
an LLVM pipeline that lets LLVM know what code paths are hot and cold. This
information can then be used during codegen for executables or libraries. There
are potentially numerous ways branch information can be used, but one of the
main ones is laying out branches so that hot paths are adjacent (i.e. they all
or mostly fall through) in the generated code. This increases icache hit rates
which can decrease stalls in the CPU pipeline and increase instructions per
cycle (IPC).
For this mechanism to work at all the LLVM pipeline needs a way to map program
counter addresses back to source lines of code. This is what DWARF does, and you
already have this information, including information about inlined calls, if you
compile your code with full debug information using -g
. However compiling with
full debug information can lead to huge executables and significantly increases
compile times. There are mitigation techniques for this (like split debug
information), but they are complicated and compiling with full debug information
still takes more time and uses more space. An alternative is compiling code with
-gmlt
(which is an alias for -gline-tables-only
). In this mode the DWARF
information for the executable contains just enough information to map any
program counter address back to the source line of code it was generated from.
Using -gmlt
lets you generate PGO or FDO profiles for code, with only a modest
increase in executable size since the line table information is encoded in a
very compact way and contains a lot less information than a full debug build
would have. Using -gmlt
is also very useful if you want a mechanism to
automatically print stack traces on crashes, since the signal handler code can
walk back the stack using frame pointers and then attribute each frame in the
stack to a specific file and line of code. If you are using a stripped binary
without debug information the best a stacktrace printer can do is print
information about which exported functions are in the stack trace, but without
line numbers (or file names).
I recently learned about a fairly obscure Clang flag that is critical for
generating profiles for code built with -gmlt
. The flag is called
-fdebug-info-for-profiling
. The Clang
documentation mentions this flag
in passing, but in my opinion doesn't do enough to call out how important it is.
Essentially when compiling with -gmlt
you get file and line numbers for all
instruction addresses, but that's it. A single line of code may have many
function calls, especially after inlining. The -fdebug-info-for-profiling
option augments the function and line tables from -gmlt
so they also include
full information about which specific function any instruction was generated
from, and can see through inlining. The original LLVM
commit shows that adding this option only
increases the size of DWARF data a small amount compared to -gmlt
alone. Using
this option in conjunction with -gmlt
makes generated PGO or FDO profiles
better, especially with respect to inlined code.
To be clear, you only need this option if you're using -gmlt
for the purpose
of profile generation. If instead you're using -gmlt
just to print stack
traces on crashes you don't need this option.