(Almost) Always use -fdebug-info-for-profiling with -gmlt

September 20, 2025

Besides changing optimization flags (e.g. -O3) and enabling LTO, the most important thing you can do to optimize C++ programs is to use PGO or FDO. In 2025 I think the best option is AutoFDO which I need to blog about since the existing documentation is lacking, so it seems like only companies with LLVM developers (i.e. FAANG companies) have the expertise to actually use AutoFDO. Without divulging too much proprietary information, using AutoFDO increases the speed of code at work by nearly 10% compared to just compiling with -O3 and ThinLTO. This is a huge optimization.

The short version of how FDO works is you have a workflows that use perf-record(1) (or potentially perf_event_open(2) if you're adventurous) to periodically sample branches taken. You can even run this in production (say 0.1% of the time) and then merge sample data. These merged samples get fed into an LLVM pipeline that lets LLVM know what code paths are hot and cold. This information can then be used during codegen for executables or libraries. There are potentially numerous ways branch information can be used, but one of the main ones is laying out branches so that hot paths are adjacent (i.e. they all or mostly fall through) in the generated code. This increases icache hit rates which can decrease stalls in the CPU pipeline and increase instructions per cycle (IPC).

For this mechanism to work at all the LLVM pipeline needs a way to map program counter addresses back to source lines of code. This is what DWARF does, and you already have this information, including information about inlined calls, if you compile your code with full debug information using -g. However compiling with full debug information can lead to huge executables and significantly increases compile times. There are mitigation techniques for this (like split debug information), but they are complicated and compiling with full debug information still takes more time and uses more space. An alternative is compiling code with -gmlt (which is an alias for -gline-tables-only). In this mode the DWARF information for the executable contains just enough information to map any program counter address back to the source line of code it was generated from. Using -gmlt lets you generate PGO or FDO profiles for code, with only a modest increase in executable size since the line table information is encoded in a very compact way and contains a lot less information than a full debug build would have. Using -gmlt is also very useful if you want a mechanism to automatically print stack traces on crashes, since the signal handler code can walk back the stack using frame pointers and then attribute each frame in the stack to a specific file and line of code. If you are using a stripped binary without debug information the best a stacktrace printer can do is print information about which exported functions are in the stack trace, but without line numbers (or file names).

I recently learned about a fairly obscure Clang flag that is critical for generating profiles for code built with -gmlt. The flag is called -fdebug-info-for-profiling. The Clang documentation mentions this flag in passing, but in my opinion doesn't do enough to call out how important it is. Essentially when compiling with -gmlt you get file and line numbers for all instruction addresses, but that's it. A single line of code may have many function calls, especially after inlining. The -fdebug-info-for-profiling option augments the function and line tables from -gmlt so they also include full information about which specific function any instruction was generated from, and can see through inlining. The original LLVM commit shows that adding this option only increases the size of DWARF data a small amount compared to -gmlt alone. Using this option in conjunction with -gmlt makes generated PGO or FDO profiles better, especially with respect to inlined code.

To be clear, you only need this option if you're using -gmlt for the purpose of profile generation. If instead you're using -gmlt just to print stack traces on crashes you don't need this option.