Alex Reinking

I Finished my Ph.D.!

2022-12-22T00:12:00-08:00

It took six and a half years, but I'm happy to announce that I finally got my Ph.D. in Computer Science. Hooray! ~~As I write this, I'm starting a short-term post-doc at MIT to wrap up a few research projects, but I'm actively applying to jobs. If you have a role that aligns with my skill set, please let me know!~~

Update: as of Feb 21, 2023, I am now working at Qualcomm AI Research!

My thesis is titled The Design and Implementation of User-Schedulable Languages. A copy is available from UC Berkeley's website, but I will update it here if the need for errata should ever arise.

I'm happy to have substantially more time than I did during my thesis crunch. If you've been following my work at all, please reach out! My contact info is in the sidebar. I'd love to talk about my research, potential collaborations, software engineering, or whatever else.

I also plan to start blogging somewhat consistently; I would love to expand the "practical considerations for DSL design in the C ecosystem" chapter of my thesis, and this blog seems like a good place to do it. These plans include my somewhat popular "CMake Without the Agonizing Pain" series.

Finally, I want to take a moment to thank the many people who helped me throughout grad school: first of all, my advisor Jonathan Ragan-Kelley. I'm honored to be your first graduating student! I'd also like to thank my amazing colleagues and collaborators: Gilbert Bernstein, Yuka Ikarashi, Hasan Genc, Daan Leijen, Ningning Xie, Leonardo de Moura, Dougal Maclaurin, Adam Paszke, Alexey Radul, Ankush Desai, Shaz Qadeer, and Ruzica Piskac.

I especially want to thank the Halide team, most notably Steven Johnson, Andrew Adams, Shoiab Kamil, and Zalman Stern, for entrusting me with so much of the project.

And of course, a big thanks to all my friends and family in Berkeley, Minneapolis, Boston, and beyond who kept me sane all these years. You know who you are and that you mean the world to me 🙂.

Until next time, happy holidays and a happy new year!

Exocompilation for Productive Programming of Hardware Accelerators

2022-06-09T12:00:00-07:00

Published at PLDI 2022.

Link to paper

High-performance kernel libraries are critical to exploiting accelerators and specialized instructions in many applications. Because compilers are difficult to extend to support diverse and rapidly-evolving hardware targets, and automatic optimization is often insufficient to guarantee state-of-the-art performance, these libraries are commonly still coded and optimized by hand, at great expense, in low-level C and assembly. To better support development of high-performance libraries for specialized hardware, we propose a new programming language, Exo, based on the principle of exocompilation: externalizing target-specific code generation support and optimization policies to user-level code. Exo allows custom hardware instructions, specialized memories, and accelerator configuration state to be defined in user libraries. It builds on the idea of user scheduling to externalize hardware mapping and optimization decisions. Schedules are defined as composable rewrites within the language, and we develop a set of effect analyses which guarantee program equivalence and memory safety through these transformations. We show that Exo enables rapid development of state-of-the-art matrix-matrix multiply and convolutional neural network kernels, for both an embedded neural accelerator and x86 with AVX-512 extensions, in a few dozen lines of code each.

How to Use CMake Without the Agonizing Pain - Part 2

2021-05-31T21:37:00-07:00

Welcome back to Part 2 of this series! I was very happy to see the warm reception Part 1 got over on /r/cpp. Before we get started, I thought I would take this opportunity to clarify a couple of points about this series.

First, this series is not a tutorial, at least not in the traditional sense. My hope with this project is to show you how to reason about CMake so that it feels intuitive. I want readers to see the big picture and to develop a taste for quality build code. Still, there will be some space dedicated to exploring specific effective practices, and pointing out common mistakes, superseded features, etc. but all with an eye towards understanding why.

Second, while I complained about the ocean of bad CMake resources, I forgot to recognize the handful of good resources that have taught me well. I have added a list of these resources to the end of Part 1.

Today, I'd like to talk about what you should expect from a CMake build, and some common pitfalls that violate these expectations. Not every CMake project you encounter will meet these criteria. I would encourage you to begin a friendly dialogue with the maintainers of non-conforming projects to see if they can be fixed (and, in the spirit of open source, try opening a PR!).

Expect vanilla builds to work

I'm going to make a bold claim, here: it should be possible to build any CMake project using any generator with the following sequence of commands, assuming all its dependencies are installed to system locations:

# For a single-configuration generator:
$ cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
$ cmake --build build
$ cmake --install build --prefix /path/to/wherever

# For a multi-configuration generator:
$ cmake -S . -B build
$ cmake --build build --config Release
$ cmake --install build --config Release --prefix /path/to/wherever

Furthermore, if the code is standards-compliant and platform-independent, this sequence should work with any compiler on any operating system.

Pitfall: unnecessary flags and settings

Obviously, if you're building a Linux-only tool that depends on GNU extensions, then you will need GCC or Clang. Unfortunately, many CMake builds assume too much about the environment or toolchain and inject optional, compiler-specific flags into their builds. Often, they provide no way to disable them. Such projects might needlessly fail on a different compiler or even a different version of the same compiler used by the author.

The most common example is adding -Werror unconditionally. The meaning of -Wall changes across compiler versions, so while this code might work for you today, it is at high risk of bit-rotting:

# BAD: don't do this!
target_compile_options(target PRIVATE -Wall -Werror)

For a subtler example, both GCC and Clang provide warning flags for missing uses of the C++11 override keyword. On GCC 5.1 and newer, it's -Wsuggest-override and on Clang 10 and below the check is split between two flags: -Winconsistent-missing-destructor-override and -Winconsistent-missing-override. Providing a Clang-only flag to GCC will throw an error, and providing the GCC-only flag to Clang will produce a warning that may be upgraded to an error if -Werror is also specified. Thus, if you naively write

# BAD: don't do this!
target_compile_options(target PRIVATE -Winconsistent-missing-override)

then your build will break with GCC! If you add -Wsuggest-override like this, then your build will break with -Werror on Clang 10! Ask yourself: do you really want to track warning flag compatibility across compiler vendors and versions? Is that a good use of your time?

I'm here to tell you that you don't want to, and that it's a waste of time. You can save yourself a lot of hassle if you only include firm build requirements in the CMakeLists.txt. Your code will build without any warnings enabled, so they don't belong there. In the past, you would have needed to create a toolchain file or at least guard these settings with the appropriate checks and options() to disable them. However, since CMake 3.19, you can add these to a preset. Create a file named CMakePresets.json next to your CMakeLists.txt with these contents:

{
  "version": 1,
  "cmakeMinimumRequired": {
    "major": 3,
    "minor": 19,
    "patch": 0
  },
  "configurePresets": [
    {
      "name": "gcc",
      "displayName": "GCC",
      "description": "Default build options for GCC",
      "generator": "Ninja",
      "binaryDir": "${sourceDir}/build",
      "cacheVariables": {
        "CMAKE_CXX_FLAGS": "-Wsuggest-override"
      }
    },
    {
      "name": "clang",
      "displayName": "Clang",
      "description": "Default build options for Clang",
      "generator": "Ninja",
      "binaryDir": "${sourceDir}/build",
      "cacheVariables": {
        "CMAKE_CXX_FLAGS": "-Winconsistent-missing-override -Winconsistent-missing-destructor-override"
      }
    }
  ]
}

Then someone (an end-user, CI, you) can use your preset like so:

$ cmake --preset=gcc -DCMAKE_BUILD_TYPE=Release
$ cmake --build build

Presets will fundamentally change the way people work with CMake and share their optional (but desired) build settings with users. They also significantly reduce the risk of your build breaking with a different compiler or version. Remember: it is much easier to write a correct build by keeping your CMakeLists.txt minimal and writing an opt-in preset than by checking all the relevant factors (compiler vendor, version, active language, etc.) before adding a flag.

To really drive this point home, this code safely adds -Wsuggest-override. It should burn your eyeballs:

# My eyes! The goggles do nothing!
option(MyProj_ENABLE_WARNINGS "Compile MyProj with warnings used by upstream" OFF)
if (MyProj_ENABLE_WARNINGS)
   # keep line width low
   set(is_clang "$<COMPILE_LANG_AND_ID:CXX,Clang>")
   set(is_gcc "$<COMPILE_LANG_AND_ID:CXX,GNU>")
   set(ver "$<CXX_COMPILER_VERSION>")

   target_compile_options(
     target
     PRIVATE
       "$<$<AND:${is_clang},$<VERSION_GREATER_EQUAL:${ver},11>>:-Wsuggest-override>"
       "$<$<AND:${is_gcc},$<VERSION_GREATER_EQUAL:${ver},5>>:-Wsuggest-override>"
   )
endif ()

This sort of thing does not scale. If a preset doesn't work for an end-user, they can override it piecemeal at the command line. On the other hand, incorrect CMake code inflicts an error with no recourse but to patch your build.

Don't forget that other people besides your core development team will use your build. Package maintainers, consumers of your library (if applicable), and power users looking to be on the cutting edge will all want to build your package with a slightly different set of flags, compilers, versions, and operating systems. The path of least resistance (using presets) both makes your CMakeLists.txt easy to maintain for you and easy to consume for all your users.

Pitfall: bad dependency management

If you use only well-behaved CMake packages with find_package, this will largely take care of itself. Unfortunately, many CMake packages are not well-behaved. To keep this article focused, strategies for wrangling bad CMake (and non-CMake) dependencies will be covered in Part 3.

Expect incremental builds to work

Some particularly pathological projects require you to run CMake twice up front in order to get a correct build. This should never be the case and is covered by the one-configure build recipe above. It's also fairly uncommon.

However, disappointingly many projects require you to manually re-run CMake before any incremental build. The whole point of CMake is to generate faithful implementations of the abstract build model. One configure step ought to be all you need. After the first run, the build tool (e.g. make) should know when it needs to re-run CMake.

The technical term here is "idempotence": running the CMake configure step twice with the same inputs should be no different from running it once. Any other behavior is unfriendly to developers and should be considered a bug with the project. (Note: Xcode has some architectural limitations that make this impossible; see this Discourse discussion for more details)

Pitfall: terrifying cache behavior

There are several ways you can unintentionally break idempotence. If you use set(CACHE), there's a good chance your build is broken. Here's an example from a bug report I filed recently. If you were wondering what the "agonizing pain" I've been talking about is, look no further. This is the kind of thing nobody should ever need to know in the first place. Suppose you have the following:

cmake_minimum_required(VERSION 3.20)
project(test LANGUAGES NONE)

set(var 1)
set(var 2 CACHE STRING "")

message(STATUS "var = ${var}")

What does it print? Let's see:

$ cmake -S . -B build
-- var = 2

What happened here? Really take a minute to think about what the underlying rule could be. Now let's try running the same command again, without changing absolutely anything:

$ cmake -S . -B build
-- var = 1

Whatever you thought the rule was, I bet you did not expect this. Why is it 1, now?

I'll fill you in: when CMake runs, it loads the cache into a special, global scope. When set(CACHE) runs, it checks to see if there is already an entry in the cache. If not, then it creates one and deletes the normal variable binding to expose the newly cached value. Otherwise, it won't do anything at all (unless FORCE is specified). Don't ask me how it works if there are multiple variables of the same name in nested directory or function scopes. I'm not sure I even want to know.

Now let's try to set the cache variable at the command line:

$ cmake -S . -B build -Dvar=3
-- var = 3

What happened here?! Neither value mattered! The normal variable won before, but now set(CACHE) overwrote it? Why? Do command-line variables have their own, special, innermost scope? Are they immutable?

Well, here's the answer: setting a cache variable at the command line with no type deletes the type that was already established (what?), and so set(CACHE) will add the type when it runs (ok...), and when this happens it will also delete the normal binding as if the variable did not exist at all (what?!), and that isn't even documented behavior (WHAT?!). If you use -Dvar:STRING=3 instead, then it will print 1.

Here's what the docs do have to say about this:

If the cache entry does not exist prior to the call or the FORCE option is given then the cache entry will be set to the given value. Furthermore, any normal variable binding in the current scope will be removed to expose the newly cached value to any immediately following evaluation.

It is possible for the cache entry to exist prior to the call but have no type set if it was created on the cmake(1) command line by a user through the -D=<var>=<value> option without specifying a type. In this case the set command will add the type.

Nowhere in that last sentence does it say it will delete the normal variable binding when the type is not set. This whole behavior is downright byzantine.

Thankfully, the devs have implemented a policy fix that will ship in CMake 3.21! With CMP0126 enabled, the set command will not touch normal variables, meaning that they always "win". This is how option() works and is yet another reason to use the newest CMake. Until then, I believe the best practice is to only cache an existing normal variable, guarded by a check if it already exists:

if (NOT DEFINED var)
  # compute default value for var
  set(var "${var}" CACHE <TYPE> "doc string")
endif ()

This ensures that the value of var is consistent no matter the state of the cache. After CMake 3.21, you may safely set the cache variable directly and to any default value.

Pitfall: configure-step dependencies

Another common cause of build issues is to fail to declare a dependency for the configure-step. If your project makes heavy use of execute_process or otherwise reads and writes files during the configure step, those files should be added to the CMAKE_CONFIGURE_DEPENDS directory property, like so:

# both `.` and `file` are relative to current source directory
set_property(DIRECTORY . APPEND PROPERTY CMAKE_CONFIGURE_DEPENDS "file")

This will cause the generated build to check those files and re-run CMake if they have changed. Some commands, like configure_file, are smart enough to update this property automatically. Others, like file(COPY) are not; use configure_file in favor of other "equivalent" commands when you can. Check the documentation (or better yet, write a test case) if you are ever unsure.

Pitfall: file globbing

This same problem also affects globbing for source files in CMake:

# WARNING: this code breaks idempotence
file(GLOB sources "*.cpp")
add_executable(my_app ${sources})

If you have this code, then adding a new .cpp file to the directory will not trigger a re-configure in an incremental build. As we discussed above, this is bad behavior because it forces a developer to re-run CMake as opposed to just the build tool.

One solution is to use CONFIGURE_DEPENDS, which will cause the generated build to re-evaluate the globs and re-configure if anything changes. This code correctly sets dependencies.

# This code is fine, but with caveats.
file(GLOB sources CONFIGURE_DEPENDS "*.cpp")
add_executable(my_app ${sources})

However, the developers do not promise that it will work on every generator. Here's what the documentation says:

Note: We do not recommend using GLOB to collect a list of source files from your source tree. If no CMakeLists.txt file changes when a source is added or removed then the generated build system cannot know when to ask CMake to regenerate. The CONFIGURE_DEPENDS flag may not work reliably on all generators, or if a new generator is added in the future that cannot support it, projects using it will be stuck. Even if CONFIGURE_DEPENDS works reliably, there is still a cost to perform the check on every rebuild.

This is not a theoretical concern: the immensely popular Ninja generator has a bug until 1.10.2 (which at time of writing is the newest one). Here is a link to a GitHub issue about this.

I understand this is controversial, but given that the CMake maintainers are so explicit about not globbing, I strongly believe the best thing to do is to list source files explicitly. In general, it is a good idea to avoid doing things that are explicitly unsupported because when you run into problems, the maintainers will simply tell you to fix your code.

Besides, manually listing source files is typically only annoying at the start of a new project, when the code structure is much more fluid. In the steady state, file lists change only occasionally, and the pain of updating a file list is not very great. You can (and should) typically split up your file lists using target_sources and add_subdirectory. That way no one CMakeLists.txt gets too long.

Update: a note on performance

An earlier version of this article repeated the old saw that globs are slow. In response to the discussion on Reddit, I ran some tests myself and got a mixed bag. Here's a table of my results:

Disk	Filesystem	OS	Generator	N	Time (s)
Samsung SSD 970 EVO	ext4 (WSL)	Ubuntu 20.04 (WSL)	Ninja	1000	0.0069
SanDisk SDSSDHII	ext4	Ubuntu 20.04	Ninja	1000	0.0162
SanDisk SDSSDHII	NTFS	Windows 10	Ninja	1000	0.0364
Samsung SSD 970 EVO	ext4 (WSL)	Ubuntu 20.04 (WSL)	Ninja	10000	0.0481
SanDisk SDSSDHII	ext4	Ubuntu 20.04	Ninja	10000	0.0594
SanDisk SDSSDHII	NTFS	Windows 10	VS 2019	1000	0.0731
Samsung SSD 970 EVO	NTFS	Windows 10	Ninja	1000	0.0832
Samsung SSD 970 EVO	NTFS	Windows 10	VS 2019	1000	0.1012
Samsung SSD 970 EVO	NTFS (3g)	Ubuntu 20.04	Ninja	1000	0.1146
SanDisk SDSSDHII	NTFS (3g)	Ubuntu 20.04	Ninja	1000	0.1170
SanDisk SDSSDHII	NTFS (9p)	Ubuntu 20.04 (WSL)	Ninja	100	0.2062
Samsung SSD 970 EVO	NTFS (9p)	Ubuntu 20.04 (WSL)	Ninja	100	0.2268
SanDisk SDSSDHII	NTFS	Windows 10	Ninja	10000	0.2743
Samsung SSD 970 EVO	ext4 (WSL)	Ubuntu 20.04 (WSL)	Ninja	100000	0.3712
SanDisk SDSSDHII	ext4	Ubuntu 20.04	Ninja	100000	0.4383
SanDisk SDSSDHII	NTFS	Windows 10	VS 2019	10000	0.4710
Samsung SSD 970 EVO	NTFS	Windows 10	Ninja	10000	0.5616
Samsung SSD 970 EVO	NTFS	Windows 10	VS 2019	10000	0.8158
SanDisk SDSSDHII	NTFS (3g)	Ubuntu 20.04	Ninja	10000	1.1119
Samsung SSD 970 EVO	NTFS (3g)	Ubuntu 20.04	Ninja	10000	1.4825
SanDisk SDSSDHII	NTFS (9p)	Ubuntu 20.04 (WSL)	Ninja	1000	1.9585
Samsung SSD 970 EVO	NTFS (9p)	Ubuntu 20.04 (WSL)	Ninja	1000	2.1879

From my testing, it seems ext4 is a remarkably resilient filesystem. I think there is no performance argument to be made against globbing on ext4. It's also pretty clear that you should not use ntfs-3g, or especially the WSL2 NTFS 9p FUSE drivers. Build on ext4 and copy the outputs to an NTFS volume if need be. VS 2019 is slower than Ninja, but even at 10000 files, it took under a second to scan 10000 sources, so this is likely not a problem in absolute terms.

For some strange reason, NTFS was slower on my NVMe drive than on my SATA drive. I tested both drives with winsat disk -drive X, and it showed my NVMe drive is significantly faster. Maybe there's some driver weirdness here since the fastest result for N=1000 was (virtualized!) ext4 on that drive.

I have published the Python script I used for testing this here. There's a GitHub Actions workflow that runs the script on Windows, macOS, and Linux for N=1000. I expected the virtualized disks on GitHub Actions to be slow, but they were actually plenty fast, with results very similar to what I reported above.

I am curious to hear reports from readers and from the Meson and Ninja developers to see if they have more data on why globs are too slow for their systems.

Expect standard CMake variables to be honored

A great number of variables in CMake are designed to be set externally. Perhaps the most famous of these is CMAKE_CXX_FLAGS and its configuration-specific variants CMAKE_CXX_FLAGS_DEBUG, CMAKE_CXX_FLAGS_RELEASE, etc. Do not touch these variables!

As a baseline, do not touch any standard variables if they are already defined when your build runs. Move your preferred defaults to presets or use the techniques above to update the cache safely. On older CMake versions, they may be set in a toolchain file as an alternative to presets. A full list of variables may be found in the documentation, but most start with CMAKE_. Notable exceptions include BUILD_SHARED_LIBS and <PackageName>_ROOT.

In many cases, there are better ways to set a build requirement than through clobbering a reserved variable. For instance, if you want to set the C++ version then you should use target features, rather than setting CMAKE_CXX_STANDARD or (gasp!) editing CMAKE_CXX_FLAGS.

target_compile_features(my_exe PRIVATE cxx_std_14)
target_compile_features(my_lib PUBLIC cxx_std_17)  # PUBLIC so that linkees use >= C++17

Setting the standard requirement as a PUBLIC (really INTERFACE) property on a library will propagate this to linkees even after exporting my_lib for use in a find_package module. We'll talk more about packaging and being a good dependency in a few weeks.

Some libraries (like abseil) change their ABI depending on the active standard version. If you have to do this, then you can encode the requirement by checking CMAKE_CXX_STANDARD to pick the correct cxx_std_N feature to act as a usage requirement:

# C++14 or greater is required for my_lib
if (CMAKE_CXX_STANDARD GREATER 14)
  target_compile_features(my_lib PUBLIC cxx_std_${CMAKE_CXX_STANDARD})
else ()
  target_compile_features(my_lib PUBLIC cxx_std_14)
endif ()

Either way, your users can set a higher CMAKE_CXX_STANDARD value at the command line. This empowers your users to ensure ABI compatibility when using experimental support for draft C++ standards when building from source. If you set CMAKE_CXX_STANDARD unconditionally, you take this control away from your users.

Conclusion

This is what you should take away from this post:

Your CMakeLists.txt file should be minimal and include only firm build requirements; everything else should be opt-in (preferably in a preset). Warning flags are not firm requirements.
The configure step of your build should never need to run twice in a row with the same settings, and incremental builds should not require the user to manually re-run CMake. This means using CONFIGURE_DEPENDS on globs or, better yet, avoiding them.
Be careful when setting a cache variable, even without FORCE, as it might remove a normal variable unpredictably. Before CMake 3.21 (unreleased), don't set(CACHE) without confirming the variable does not exist.
Avoid touching standard CMake variables; prefer target properties or move such settings to the presets (at least make your edits opt-in somehow). Stop thinking in terms of flags and start thinking in terms of goals. It's very common for novice (or even adept) CMake programmers to work themselves into an XY problem and try to shoehorn in a compiler-specific setting that has already been abstracted.

Next time, we'll talk about the target model and how to manage dependencies in modern CMake. Until then, join the conversation here on Reddit!

How to Use CMake Without the Agonizing Pain - Part 1

2021-05-22T21:37:00-07:00

When age fell upon the world, and wonder went out of the minds of men; when grey cities reared to smoky skies tall towers grim and ugly, in whose shadow none might dream of the sun or of spring's flowering meads; when learning stripped earth of her mantle of beauty, and poets sang no more save of twisted phantoms seen with bleared and inward-looking eyes; when these things had come to pass, and childish hopes had gone away forever, there was a man who travelled out of life on a quest into the spaces whither the world's dreams had fled. — H.P. Lovecraft

I spent the better part of my off-hours last year rewriting Halide's CMake build.

I knew CMake had a polarizing reputation, but I needed to make Halide work easily on Windows. The existing build didn't work right in CLion, it couldn't find its dependencies (except on CI, somehow), and it didn't produce usable packages. I figured I'd roll up my sleeves and get to work, and so I started where anyone else would: by Googling "CMake tutorial".

I was nearly stricken blind.

There is so much bad information about CMake out there. It's pervasive. It's high in the search results. Just about every StackOverflow answer is out of date, wrong, or both. Heeding any of this advice will send you and your project careening down a road to madness, paved into the earth by the sweat and tears of those who have tried to port a project that hard-codes a library path.

If you don't want your builds to break, and your crops to die, you should learn to use CMake properly. This is the first in a series of blog posts that will attempt to teach you to use CMake effectively. My earlier post about whether CMake is a build system could be considered part 0 of this series.

So without further ado, let's talk about the most basic decision to make: what version of CMake to use in the first place.

Picking a CMake Version

If you're writing an open source project, you most likely want to make your code available to as many users as possible. So you might assume that you want to use a very old CMake version to build your project. This is nonsense. Recent versions of CMake are available absolutely everywhere. Your build's users are technical: C++ developers, not laypeople. They can upgrade CMake if for some reason they haven't yet. For every major platform, there are easy ways to get a recent CMake version installed and kept up to date. Don't believe me? See the table below.

OS	Arch	Source	Version	Update Process
Windows 10	x86, amd64	Visual Studio 2019	3.19	Updated occasionally through VS installer
Windows 10	x86, amd64	Chocolatey	newest	`choco upgrade`
Windows 10	x86, amd64	Kitware MSI	newest	Manual
Windows 10	x86, amd64	Kitware ZIP	newest	Manual (no installer)
macOS 10.14+	universal	Homebrew	newest	`brew upgrade`
macOS 10.10+	universal	Kitware DMG	newest	Manual
macOS 10.10+	universal	Kitware TGZ	newest	Manual (no installer)
Ubuntu 16.04+, many other distros	x86, amd64, aarch64, armhf, ppc64el, s390x	snap	newest	Fully automatic
Ubuntu 16.04+	x86, amd64	Kitware APT	newest	`sudo apt upgrade`
Ubuntu 20.04+	x86, amd64, aarch64, armhf	Kitware APT	newest	`sudo apt upgrade`
Ubuntu 20.04 LTS	x86, amd64, aarch64, armhf, ppc64el, s390x	Ubuntu APT	3.16.3	`sudo apt upgrade` (security only)
Linux (Generic)	amd64, aarch64	Kitware TGZ	newest	Manual (no installer), only depends on glibc6
ALL	x86, amd64, aarch64, armhf, ppc64el, s390x	pip	newest	`pip install -U cmake`

I can't stress this enough: Kitware's portable tarballs and shell script installers do not require administrator access. CMake is perfectly happy to run as the current user out of your downloads directory if that's where you want to keep it. Even more impressive, the CMake binaries in the tarballs are statically linked and require only libc6 as a dependency. Glibc has been ABI-stable since 1997. It will work on your system.

We on the Halide team use the CMake 3.20.2 tarballs from Kitware on a variety of aging and new ARM hardware for our build infrastructure. We used to build CMake from scratch, which was a little painful, but since upstream started providing ARM binaries, it's been trivial.

There are good reasons for using modern CMake versions, too. Beyond broader compiler and platform compatibility, newer CMake versions offer many more features to help keep your builds simple and expressive. One of the best examples is CMake's CUDA support. It has gone through several evolutions from a find module to a full first-class language. Working with CUDA prior to CMake 3.17 is about as much fun as eating glass. The move away from package variables to targets with transitive, propagating properties has turned ugly, error-prone build scripts into simple, declarative build descriptions. We will touch on many of these features in the next few parts.

So there is no problem with taking a minimum version of 3.20 (the latest at time of writing). Maybe it's worth taking a minimum of 3.16 just because Ubuntu 20.04 LTS is such a hold-out, but anything earlier than that is plain masochism.

Another hard requirement is that you must never use a version of CMake older than your compiler. Older versions of CMake won't somehow know how to work with a compiler that was released later in time, and the command line defaults for GCC, Clang, and other major compilers changes frequently. The most basic example of this is the default language version and set of supported language versions. Other changes include the wording of errors and warnings that CMake matches to detect compiler capabilities.

Thus, if you intend to use C++17 on Linux, you will need to use at least Clang 5 (released Sep 7, 2017) or GCC 7 (released May 2, 2017), so you therefore cannot use a minimum CMake version prior to 3.9.3 (released Sep 20, 2017), and versions prior to 3.8 (released April 10, 2017) didn't even understand 17 as a possible value of the CXX_STANDARD target property, so there was no correct way to enable it. Rather than doing this tedious and ultimately pointless work of determining the oldest potentially compatible versions, just use the newest.

Validating Your Minimum Version

No matter what minimum version you pick for whatever reason, it would be a major mistake to simply set cmake_minimum_required(VERSION 3.X) and call it a day. You must also test with the actual CMake 3.X release on your local development machine and on CI.

Why? Simply because the policy mechanism ensures backwards compatibility, not forwards compatibility. If you use a more recent CMake version, nothing will stop you from using a feature that is too new for the declared minimum version. This is very, very common, too. Here are three examples off the top of my head that have bitten me:

You might use a generator expression that was not in the old CMake version. CMake will not even try to warn you about this, and many common and useful generator expressions were introduced later than you think.
You might rely on newer features of commands unintentionally. In particular, CMake 3.18+ searches lib64 directories when using HINTS arguments to find_library, but older ones don't. So code for old versions have to check CMAKE_SIZEOF_VOID_P and add those paths to HINTS manually. I don't think this is documented; I bisected to find that version number.
CMake's find modules change behavior pretty frequently. Old versions might not understand a newer library version's package layout.

So another basic rule is to never declare a minimum version lower than the one you actually test your build against. I have seen projects in the wild that claim compatibility with ridiculously old versions of CMake, like 2.6. Not only is it extremely unlikely that those builds actually work with 2.6, newer versions of CMake are soon dropping compatibility with versions before 2.8.12. So this "increased" compatibility will in fact cost you users who are doing the right thing by keeping up to date.

If you're setting up a CI pipeline, you should test your build with both the absolute newest version of CMake, and the minimum required version. This will allow you to very quickly catch backwards compatibility bugs and make upgrading the version a breeze. I do this on GitHub Actions with the jwlawson/actions-setup-cmake action. You can see such a workflow here on tinyxml2, whose CMake build I recently helped modernize.

Conclusion

These are the most important lessons from this post:

Use the most recent CMake version. It is trivial to install and keep up to date. If you must pick an older version, do it for a logical reason, not because you're copying some ancient StackOverflow answer that set 3.5 as a minimum.
Use a version of CMake at least as recent as your compiler version.
Always test your build with the actual CMake version you're taking as a minimum.

In part 2, we'll talk about the contract between a CMake build and its many consumers.

Until then, join the conversation here on Reddit!

Addendum

Distribution minimum versions

Since publishing this article, I have heard from several readers that they cannot upgrade their minimum versions because some particular Linux distribution (e.g. Ubuntu 18.04 LTS, RHEL 7, etc.) packages an older version of CMake, and so they must accept that version.

I stand by what I wrote. On the one hand, if the maintainers independently want to include your package, then it's up to them to figure out how to use a newer CMake version in their build process. If that means bootstrapping a newer CMake version from source, so be it.

On the other hand, if you want to ask the maintainers to include your package, and they won't let you use a newer version, you should instead ask yourself why your package needs to be included in the base distribution. There are many viable distribution methods on Linux these days. You could host your own APT or RPM repository; you could release your package on pip or snap. Agreeing to downgrade for the sake of one distro harms all of your users.

Lowest common denominator thinking is toxic to the progress of the C and C++ ecosystems. Distribution maintainers should periodically update CMake, even on LTS releases. CMake is incredibly backwards compatible, but when there are issues, there are also many recourses for a distribution maintainer: they can package multiple CMake versions, they can patch a problematic package (and maybe upstream the patch, which is better for everyone), or they can patch their distribution of CMake. The vcpkg team occasionally has to rewrite entire build systems for projects.

Resources

At the beginning of the article, I complained that there are no good learning resources. Fortunately, this isn't quite true. So far as I know, the best places to get high quality advice for writing CMake code are these:

The CMake Discourse forum. The actual developers hang out and answer questions here.
The #cmake channel on the CppLang Slack. This is a very friendly community for CMake users to think through build issues, ask beginner to intermediate level questions, and share wisdom.
The book "Professional CMake" by Craig Scott. Craig is a volunteer maintainer of CMake, and he sells his book for $30 through his consulting business. This is the most comprehensive and clearly written reference guide for CMake. Even better, your purchase also includes updates to new editions as the book is updated (and it is updated frequently). This is a must-have if CMake is part of your job; you should convince your employer to purchase copies for your team.

If you don't want to buy Professional CMake or can't afford it, here are some good free resources on the web:

Craig gave a talk at CppCon 2019, "Deep CMake for Library Authors" that covers issues including symbol visibility, library versioning, writing install rules, and RPATH pitfalls.
Deniz Bahadir gave a pair of talks called "More Modern CMake" and "Oh No! More Modern CMake" at Meeting C++ 2018 and 2019, respectively. These talks use CMake 3.12-3.14, so there are some things that are out of date, but his explanation of the modern CMake targets system is very good. We'll talk about dependencies soon, but I disagree with the approach here, which sets IMPORTED_GLOBAL on targets created by find_package calls. These talks are particularly valuable for showing the old, painful way of doing things next to the new(er), less-painful way.
Robert Schumacher is the lead developer of vcpkg and has a lot of experience with dealing with every type of problematic build system. He's also a great presenter and generally a smart guy, so I wholeheartedly recommend his talks:
1. "Don't Package Your Libraries Write Packagable Libraries! Part 1".
2. "Don't Package Your Libraries Write Packagable Libraries! Part 2". Note: I disagree with his use of globbing in CMake, but his point about projects being globbable is good.
3. "How to Herd 1,000 Libraries"

Building a Faster Triangular Solver than MKL

2021-03-20T21:37:00-07:00

A significant part of my research involves investigating algorithms with interesting properties and then trying to optimize them to fully understand how they work. One recent, and fairly successful, exploration was into triangular substitution solvers. In this blog post, I'm going to explain the algorithm and an unconventional recursive approach that broadly abstracts the design space for possible optimizations.

The end result is a lower-triangular (forward substitution) linear equation solver that beats MKL, at least on a (not too) simplified version of the problem. If you just want the sources and none of the story, they are available on GitHub.

For the rest of this article, I'm going to assume you know entry level matrix computations, i.e. how to multiply matrices and vectors, and how to do Gaussian Elimination.

BLAS and triangular solvers

(If you are already familiar with BLAS or with strsv you can skip this section.)

BLAS (Basic Linear Algebra Subprograms) is a specification of an interface of common linear algebra operations, such as (most famously) matrix multiplication, vector fused multiply-adds, and triangular substitution solvers. Most commonly, BLAS-es are implemented in Fortran because of its superior aliasing semantics (the restrict keyword is not necessary), but they are also available in C/C++ through the standard "cblas" interface.

The idea is that by specifying the interface and providing a reference implementation, hardware vendors can produce optimized libraries for each of their architectures. And boy, do they ever! There's a wealth of academic literature and millions of dollars of commercial effort put into optimizing these routines. Among the more notable implementations are OpenBLAS (which is based on GotoBLAS), ATLAS (which tries to automatically tune itself to your hardware), the AMD Optimizing CPU Libraries (aka AOCL), NVIDIA's GPU-based cuBLAS and NVBLAS (which are ridiculously fast), and most famously, there's MKL, which is widely regarded as the gold standard of x86 CPU BLAS implementations (well, at least Intel's x86, though this might be changing). This list isn't exhaustive; supercomputing vendors like Cray supply BLAS-es that are tuned to their hardware.

Now, as I explain in my CppCon 2020 talk, BLAS and high-performance libraries like it are fundamentally limited because they have to synchronize with main memory in between each library call. There's no effective way to fuse computations across stages. There are some C++ libraries, like Eigen and armadillo, that use templates to get some amount of fusion. However, their results are less consistent, and their optimizations are less dramatic (local fusion is no match for global reorganization) than using a full DSL designed for the task, like the Halide language I work on. More on Halide in a future post!

Still, most BLAS-es do a very good job of optimizing their routines. Matrix multiplication in particular is an excellent exercise for anyone interested in understanding machine performance because there are $O(n^3)$ floating-point operations (FLOPS) to schedule against only $O(n^2)$ data. This endows the problem with a very rich design space. In fact, here at UC Berkeley, it is the first assignment in our graduate parallel computing course. If you're interested, the homework materials are here. (By the way, I'm proud to say that while writing this article I learned that my work as a teaching assistant on the Spring 2020 edition of the course earned me an "Outstanding GSI Award" from the EECS department.)

The API that we're discussing today is strsv. The problem it solves is the matrix equation $Ax=b$ where $A$ is a square $n \times n$ matrix, and $b$ is a vector of length $n$ . $A$ is assumed to be triangular, which allows fast solving because simple, direct substitution may be used. Here's a quick example; suppose we have the following equation: $Ax = \begin{pmatrix} 1 & 0 & 0 \\ 3 & 1 & 0 \\ 4 & 2 & 1 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} = b$ Because $A$ is lower-triangular, we can immediately tell that $x_1 = 1$ . We can very quickly eliminate $x_1$ in the other rows, by just multiplying $x_1$ by the coefficient in each row in the column and subtracting it from the latter values of $b$ . So we'll subtract $3$ and $4$ from the second and third entries to get: $\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 2 & 1 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix} 1 \\ -2 \\ -3 \end{pmatrix}$

For a quick sketch of a proof of why this works, notice that each row operation is equivalent to a matrix multiplication. In this case, the matrices $R_1, R_2$ (below) applied to both sides of the equation (on the left, since matrix multiplication is not commutative), gives us the equation we have above.

$R_1 = \begin{pmatrix} 1 & 0 & 0 \\ -3 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix} \; R_2 = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ -4 & 0 & 1 \end{pmatrix}$

That is, the equation $Ax = b$ has the same solution as $(R_2 R_1) A x = (R_2 R_1) b$ .

In the final step, we eliminate the second column: $\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix} 1 \\ -2 \\ 1 \end{pmatrix}$ We can check this answer, too: $Ax = \begin{pmatrix} 1 & 0 & 0 \\ 3 & 1 & 0 \\ 4 & 2 & 1 \end{pmatrix} \begin{pmatrix} 1 \\ -2 \\ 1 \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} = b$

Hooray! In the next section we'll go over the algorithm in the abstract and write a naive implementation.

Solver algorithm and interface

So what does this look like as a formal algorithm? Well, what did we do on paper? We started by going across the columns, and then within each column, using the newly solved value in $x$ to update the unsolved part. As a "plain" English algorithm, it looks like this:

Solving: $Lx = b$

Set $x = b$

For each column $j$ of $L$ :

Set $x_j \leftarrow x_j / L_{jj}$ .

For each row $i$ in the column $j$ , starting with $j+1$ :

Update $x_i \leftarrow x_i - x_j \cdot L_{ij}$

Now how do we turn this into code? For the sake of space (and my sanity writing and optimizing this stuff), we'll make the following simplifying assumptions:

The matrix $L$ is lower triangular.
The matrix $L$ has all $1$s on its diagonal. This lets us skip the division on line (3.1) above.
The matrix $L$ is stored in column-major order.
The matrix $L$ is stored in a large, dense array in natural order; the upper half might contain useful information (like an upper triangular matrix), so we cannot overwrite it or assume it to be zero.
The vector $b$ is stored in a normal array and may be overwritten with the solution $x$ .
We're running on a single CPU core.

The naive translation under these assumptions into plain C is this:

void naive_solver(int n, float* L, float* x) {
  for (int j = 0; j < n; ++j) {
    for (int i = j + 1; i < n; ++i) {
      x[i] -= x[j] * L[i + n * j];
    } 
  } 
}

These assumptions are so common that the BLAS API for this takes extra arguments to inform the implementation when these are the case. Here's the full signature in C:

enum CBLAS_ORDER {CblasRowMajor=101, CblasColMajor=102};
enum CBLAS_TRANSPOSE {CblasNoTrans=111, CblasTrans=112, CblasConjTrans=113};
enum CBLAS_UPLO {CblasUpper=121, CblasLower=122};
enum CBLAS_DIAG {CblasNonUnit=131, CblasUnit=132};

void cblas_strsv(const enum CBLAS_ORDER order, const enum CBLAS_UPLO Uplo,
                 const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_DIAG Diag,
                 const int N, const float *A, const int lda, float *X,
                 const int incX);

The name strsv encodes a few facts about the function. The leading s stands for "single-precision" and the trailing v stands for "vector". The base name of the function is therefore trs, which is short for "triangular solve". Thus, the function solves a triangular matrix-vector equation in single precision (i.e. float).

The order argument determines whether the input matrix will be treated as row major or column major. To be column-major simply means that adding 1 to the pointer into the matrix will move down one row (ie. with the current column); similarly, row-major means that adding 1 moves to the right one column. The Uplo argument tells the implementation whether we're giving it a lower or upper triangular matrix. The TransA algorithm allows the user to ask that BLAS implicitly transpose (or conjugate transpose in the case of complex values) while solving. Finally, the Diag argument tells strsv whether the main diagonal is all 1s.

So we can implement a function with the same signature and contract as above using the BLAS library like so:

void blas_solver(int n, float* L, float* x) {
  cblas_strsv(CblasColMajor, CblasLower, CblasNoTrans, CblasUnit,
              n, L, n, x, 1);
}

Now is a good time to benchmark these two implementations to get some idea of how far off we are.

Benchmarking setup

First things first: we need to understand how much work we're doing. It's pretty clear that we're doing $O(n^2)$ operations, but it's easy enough to get an exact count. If we look at the naive algorithm, we'll notice that the innermost update consists of two floating-point operations: (1) the multiplication between x[j] and L[i + n * j], and (2) the subtraction of the resulting value from x[i]. Then the inner loop runs between $j+1$ (inclusive) and $n$ (exclusive). That's $n-j-1$ iterations in total. The outer loop runs between $j=0$ to $j=n-1$ . In math terms, the total number of FLOPS is:

$2 \cdot \sum_{j=0}^{n-1} n-j-1 = 2 \cdot \sum_{j=0}^{n-1} j = 2 \cdot \frac{n \cdot (n-1)}{2} = n \cdot (n-1)$

So to solve an instance with an $n\times n$ matrix, we must perform $n\cdot (n-1)$ floating-point operations.

We're going to use AVX2 to optimize this routine because it's still a bit more widely available than AVX-512 (and because it doesn't have quite so extreme CPU frequency offsets). I have benchmarking set up on GitHub Actions. At time of writing, the cloud runners have Xeon 8171M CPUS clocked down to 2.3GHz. I also tested locally on my i9-7900X workstation. Both CPUs are Skylake, so I compile with -march=skylake on GCC.

We're going to test against both OpenBLAS and MKL. By default, both BLAS-es dispatch the APIs to hardware-specific implementations by sniffing CPU flags. Since the GitHub Actions runners support AVX-512, this would pose a challenge. Fortunately, both BLASes offer ways to override this. When compiling OpenBLAS, we may set -DTARGET=HASWELL on the CMake command line. For MKL, we can run export MKL_ENABLE_INSTRUCTIONS=AVX2. To keep things on one core, we can export OPENBLAS_NUM_THREADS=1 and link to the sequential MKL library.

To get a full picture of performance, we'll test on a variety of matrix sizes so that we can see how we perform when the data fits inside L1, L2, or L3 cache, plus when it spills out into RAM. The L3 cache of the GitHub Actions chips is 35.75MB in size. Without getting too much into the math, there's 4 bytes per float and less than $n^2$ data in our working set. So using matrices at least as large as $3000\times 3000$ will exceed L3. To be safe, we'll use $n=4096$ as the upper bound.

Finally, I'll use Google Benchmark to compute performance numbers and use the formula we derived above to scale raw time into FLOPS.

So here's our baseline:

Keeping in mind that GCC has already auto-vectorized the naive implementation, there doesn't seem to be a lot of headroom here. Roughly speaking, it looks like the naive solutions runs at about 8 GFLOPS, while MKL runs around 12 GFLOPS or 50% faster. OpenBLAS is generally slower, but seems to do slightly better than MKL when the size of the matrix is just about to escape the computer's L3 cache. Naturally, once we hit RAM, the $O(n^2)$ work just isn't enough to hide the latency of the $O(n^2)$ memory. This is in stark contrast to matrix multiplication, which has $O(n^3)$ work to do.

A curious recursion

While analyzing the algorithm, I made one key observation: at the start of the inner loop on iteration $j$ of the outer loop, all the values of $x_i$ for $i \leq j$ are finalized. Thus, we can reformulate the problem into a recursive algorithm that solves the top $k \times k$ triangle first, then uses the first $k$ entries of $x$ along with the $n-k \times k$ rectangle below that triangle to update the remaining $n-k$ entries of $x$ . Finally, we can solve the right $n-k \times n-k$ triangle with the updated bottom part of $x$ .

This is a sort of divide and conquer approach to this problem. When I came up with it, I had never seen it before, but when I started poking around, I found some recent work by Elmar Peise and Paolo Bientinesi: "Recursive Algorithms for Dense Linear Algebra: The ReLAPACK Collection". On the one hand, this was disappointing because my idea wasn't actually novel (hence, a blog post rather than a research paper); on the other hand, this was encouraging because it meant I was on the right track. Such is life.

Anyway, the next insight is that the way you combine the lower rectangle with the solved part of $x$ is to compute a matrix-vector product between them and subtract the result from the unsolved part of $x$ . To see that, look at the computation we're doing:

x[i] = x[i] - L[i, j] * x[j]

Now $j$ ranges over $[0, k)$ , because we already handled the top triangle. We also know that $i$ ranges from $k$ to $n$ . This code then becomes the following, in numpy-esque vector notation:

x[k:n] = x[k:n] - L[k:n, 0:k] * x[0:k]

Very helpfully, the BLAS contains an operation, sgemv, that does exactly this. So the lazy way to implement the recursive algorithm is to reduce it to sgemv like this:

void solve_dnc(int n, float *L, int lda, float *x) {
  if (n <= BASE_CASE_LIMIT) {         // Naive algorithm.
    for (int i = 0; i < n; ++i) {     // GCC happens to generate better code
      for (int j = 0; j < i; ++j) {   // for this loop order. Don't know why.
        x[i] -= x[j] * L[i + j * lda];
      }
    }
  } else {
    int k = n / 2; // WOAH - this one line determines the algorithm

    // Upper triangle -- reads L(:k,:k), x(:k); writes x(:k)
    solve_dnc(k, L, lda, x);

    // Rectangle -- reads L(k:,:k), x(:k); writes x(k:)
    cblas_sgemv(CblasColMajor, CblasNoTrans, n - k, k, -1.f, L + k, lda,
                x, 1, 1.f, x + k, 1);

    // Right triangle -- reads L(k:,k:), x(k:); writes x(k:)
    solve_dnc(n - k, L + lda * k + k, lda, x + k);
  }
}

This version takes an extra parameter, lda, to manage the distance between columns independently of the logical dimension. The neat thing about this characterization is that it lets us explore the space of optimizations entirely by varying the function that calculates $k$ . In this case, we chose a recursive approach, but we could also set it to, say, BASE_CASE_LIMIT to proceed in blocks of columns (spoiler alert!), or to n - BASE_CASE_LIMIT to proceed in blocks of rows. Various hybrid approaches could be designed off of this, too, all by varying that one line of code.

There are some clear disadvantages here. This isn't tail recursive, so it will take some extra stack space and cost some function call overhead. The compiler also can't inline strsv since it's squirreled away in a shared library and is very, very proprietary (so no-go on LTO). Still, this exercise has clearly exposed our best vectorization opportunity. It would be very difficult to vectorize a small triangle, but maybe we can get away with only doing $O(n)$ serial triangles, and $O(n^2)$ easier-to-vectorize rectangles.

Since I know you're curious, this is how well the $n/2$ divide and conquer approach works.

It's surprisingly in the ballpark when using OpenBLAS's sgemv. What's interesting is that for at least one matrix size, it ever-so-slightly edges out MKL, despite being built from OpenBLAS. This could be a fluke, but I bet there's an even better optimization than any in this article that we just haven't found yet.

Lower-level optimization

I played around with the divide-and-conquer approach for a bit and settled on a split "function" of simply $k=8$ . That corresponds to looping over 8-wide block columns of the matrix, solving the $8\times 8$ triangle at the top and then the whole rectangle beneath it. It seemed to perform best on my workstation, and so I set out to "inline" everything and get it cleaned up. Here it is, chunk by chunk.

Note, for simplicity, I'm specializing this code to multiple-of-8 matrix sizes. Extending it to other matrix sizes only requires dealing with a small leftover rectangle at the bottom of each block column. It's just another code path, and the same basic strategies apply. It's a good exercise, but too much for a blog post. Also, as you'll see, the resulting code is so much faster that MKL could get a boost just by testing the matrix size and then dispatching to this solver if it fits. That one branch up front would cost next to nothing.

First, we'll declare the function and start looping:

void update_blocked(int n, const float *L, int lda, float *x) {
  while (true) {

So why are we using an infinite loop here rather than a for loop over the block columns? Well, remember that we're going to solve a triangle, then a rectangle, then a triangle, and so on until we hit the rightmost triangle, which has no rectangle underneath it. So we want to exit the loop right away without testing the conditions for the would-be for loop or for the rectangle code again. Here's the code for the triangle and early stopping:

    // Handle triangle at top of block column
    for (int j = 0; j < 8; ++j) {
      for (int i = j + 1; i < 8; ++i) {
        x[i] -= x[j] * L[i + lda * j];
      }
    }

    n -= 8; // Last iteration doesn't have a rectangle
    if (n <= 0) { return; }

At this point, we have solved the first 8 values of $x$ . We subtract 8 from $n$ right away since the following code operates on the shorter rectangle. Now we're going to take those 8 values we just computed and broadcast them into 8 vector registers. We first create a typedef to use GCC's vector types feature,

// Vector of 8 single-precision floats
typedef float v8sf __attribute__((vector_size(32), aligned(1)));

and then create an array of these with the broadcast values:

    v8sf x_solved[8];
    for (int i = 0; i < 8; i++) {
      x_solved[i] = _mm256_broadcast_ss(&x[i]);
    }

Because we're using GCC's vector types and its own intrinsics, it is smart enough to compile this into exactly 8 instructions that load the values into registers. So there's no overhead from the loop or from the array. We load these values into registers now because they're involved in every computation in the rectangle, so we don't want to constantly reload them from memory. We broadcast them so that we can load individual columns into vectors from inside the block column. For example, we can take a vector from the first column in the block, multiply it by x_solved[0] and then subtract it from the corresponding portion of x.

To set this up, we'll advance L to point to the top of the rectangle and advance x to point to the first unsolved portion and then enter the loop:

    L += 8;
    x += 8;

    for (int i = 0; i < n; i += 8) {

The first order of business is to load a vector's worth of the unsolved chunk of x. We have to do an unaligned load (loadu) because alignment wasn't in our assumptions and because aligning it would take too long (remember, $O(n^2)$ on both operations and memory).

      v8sf x_i = _mm256_loadu_ps(&x[i]);

Then we'll load an $8 \times 8$ patch of $L$ into vectors using the same trick as above.

      v8sf L_patch[8];
      for (int j = 0; j < 8; j++) {
        L_patch[j] = _mm256_loadu_ps(&L[i + lda * j]);
      }

Finally, we update the unsolved vector using that patch of values from the matrix. We write the vector back to x and advance L to the tip of the next triangle, ready to repeat the process.

      for (int j = 0; j < 8; j++) {
        x_i -= x_solved[j] * L_patch[j];
      }

      _mm256_storeu_ps(&x[i], x_i);
    } // for i

    L += lda * 8;
  } // while true
}

The assembly generated for the rectangle loop is as short as can be. Just thirteen instructions, almost all vectorized. You can see the full assembly on Godbolt, here: https://godbolt.org/z/YGWfoz9fs.

.L5: vmovups ymm0, YMMWORD PTR [r8+32+rax*4]
     vfnmadd213ps    ymm0, ymm8, YMMWORD PTR [rsi+32+rax*4]
     vfnmadd231ps    ymm0, ymm7, YMMWORD PTR [r15+32+rax*4]
     vfnmadd231ps    ymm0, ymm6, YMMWORD PTR [r14+32+rax*4]
     vfnmadd231ps    ymm0, ymm5, YMMWORD PTR [r13+32+rax*4]
     vfnmadd231ps    ymm0, ymm4, YMMWORD PTR [r12+32+rax*4]
     vfnmadd231ps    ymm0, ymm3, YMMWORD PTR [rbx+32+rax*4]
     vfnmadd231ps    ymm0, ymm2, YMMWORD PTR [rdi+32+rax*4]
     vfnmadd231ps    ymm0, ymm1, YMMWORD PTR [rcx+32+rax*4]
     vmovups YMMWORD PTR [rsi+32+rax*4], ymm0
     add     rax, 8
     cmp     r9d, eax
     jg      .L5

The beauty of this is how it minimizes memory traffic. We're streaming memory in from $L$ exactly the one time we need it, as part of the instruction that needs it. In the assembly above, ymm0 stores the unsolved vector from $x$ , while ymm1-8 store the broadcast solved values.

The code for the triangle is messy and mostly scalar, but I stopped trying to optimize once I saw this:

At least on GitHub Actions, this blocked solver is never slower than MKL. At peak, it's nearly twice the speed of the naive solver and 50% faster than MKL, roughly. This is why I said I didn't want to bother with non-multiple-of-8 sizes earlier. The dispatch would be totally lost in the gap.

Conclusion

The triangular solver routine must not get a lot of love in BLAS implementations. Judging by the performance of my divide and conquer solver, I wouldn't be surprised if MKL and OpenBLAS were just using (an inlined version of) their own sgemv routines without giving this one any special attention. Still, the results to effort ratio here is pretty striking.

It would be an interesting exercise to build a full-strength solver that handles all matrix sizes, row-major layouts, double precision, etc. but that's too much for one blog post (and too much for my purposes of understanding the design space of this algorithm better).

CMake IS a Build System

2021-03-13T21:37:00-08:00

One of the most common things you'll hear when learning CMake is that "CMake is not a build system". This is technically correct, depending on one's definition of a "build system". However, this statement alone is meaningless on a practical level as it doesn't communicate anything actionable regarding how to approach CMake. It just invites semantics games. The slightly clickbait headline aside, my goal in this article is to unpack what CMake really is in a way you can hopefully use to understand CMake better.

Still, I do understand why people like saying this: technically correct is the best kind of correct, after all.

What is a build system?

To even have this discussion, we'll have to pin down a definition of what a build system is. Let's ask Jeff Atwood, co-creator of the venerable StackExchange:

The value of a build script is manifold. Once you have a build script together, you've created a form of living documentation: here's how you build this crazy thing. And naturally this artifact is checked into source control, right alongside the files necessary to build it (and even the database necessary to run it, too). From there, you can begin to think about having that script run on a neutral build server to avoid the "Works On My Machine" syndrome. [...]

This is from his blog post "The F5 Key Is Not a Build Process". This was written a while ago, in 2007, somewhat before CMake became wildly popular. It was also written in the context of C#, which is more tolerant of "just click 'build' in Visual Studio" workflows than C++, which isn't managed.

Still, it touches on a very important point, which is that a build system serves as a source of truth for how to build your software. If that's the essence of what a build system is, then CMake fits the bill.

Maybe you don't believe Jeff. After all, he says "build process" rather than "build system", so maybe he's talking about something else. Let's ask academia. The 2018 paper, "Build Systems à la Carte" by Andrey Mokhov, Neil Mitchell, and Simon Peyton Jones (of Haskell fame), gives a rigorous definition:

Keys, values, and the store. The goal of any build system is to bring up to date a store that implements a mapping from keys to values. In software build systems the store is the file system, the keys are file names, and the values are file contents. [...]
Task description. Any build system requires the user to specify how to compute the new value for one key, using the (up to date) values of its dependencies. We call this specification the task description. For example, [...] in Make the rules in the makefile are the task description.
Build system. A build system takes a task description, a target key, and a store, and returns a new store in which the target key and all its dependencies have an up to date value.

According to this definition, technically, CMake is not a build system because it isn't responsible for running your tasks, so it can't bring the store "up to date", but it does have a full task description language which assumes dependencies on files and their time stamps.

On the other hand, this is a build system:

#!/bin/bash -e
cd "$(dirname "${BASH_SOURCE[0]}")"
cmake -G Ninja -S . -B _build "$@"
cmake --build _build

The keys, value, and store are the same as they are for every conventional build system: the file-system contents. The task description is now firmly the CMakeLists.txt and the build system is this script. The fact that it calls Ninja is an implementation detail. This is also technically correct.

Mokhov, et.al. is a fascinating paper, and you should absolutely read it (did you know that Microsoft Excel is a build system?); but the purpose of their research is to taxonomize the ways various build systems model tasks and dependencies, and then carry out execution plans over those dependencies. It's not about pragmatic questions concerning the software lifecycle, but about the design space of certain tools that serve a particular purpose therein.

The descriptivist definition of "build system" would be much closer to what Jeff has in mind. When most people think about build systems, they aren't narrowly constraining themselves to the actual tool that invokes the compiler. For their purposes, the meta/non-meta distinction doesn't affect how they interact with CMake.

A compilers look at CMake

So why do people bother to draw this distinction? What do people think is actually meaningful about CMake being a "meta build system" or a "build system generator" rather than a plain "build system"? There isn't similar controversy about the GNU Build System (ie. autotools), and it also has separate configuration and build invocation steps. Heck, it popularized that process. Ever see this?

$ ./configure && make && make install

Yes, the configure script isn't a build system on its own, but you always run make afterwards. Autotools and CMake both call themselves build systems. Are they wrong? Sure, but only technically.

In the most common case, both CMake and Autotools are the single source of truth for building their respective projects. In order to build such software, you have to go through CMake (resp. Autotools) first. You get your pick of execution engine, but it's semantically irrelevant (ideally). There is a bug either in your CMake code or in CMake itself if you get different results from one backend versus another.

In 2018, David Chisnall wrote an article in ACMQueue titled "C Is Not a Low-level Language". The tagline, "Your computer is not a fast PDP-11", distills the central point of the article: that thinking about C programming in tandem with your target architecture is incorrect, because C targets an abstract machine which has its own semantics that the compiler is responsible for mapping to the target assembly language. There are some fascinating pitfalls detailed in the article, like how undefined behavior and pointer provenance can disable "obvious" optimizations (like loop unswitching), and delete null checks.

By analogy, CMake is not Make. Nor is it Visual Studio, or Ninja, or any of its many target backends. If the CMake generator is the architecture, then CMake code is C, the abstract build model it creates is the abstract machine, and targets with generator expressions are its IR. It is accurate to say that CMake is a domain-specific language for metaprogramming an abstract build model, which is assembled into input files for a build execution engine (*ahem* build system) of your choice.

When you search for "CMake is not a build system", this reminder appears in a few different contexts. Sometimes it's cited as an advantage, for example, when JetBrains says:

Yet another benefit, is that CMake is not a build system in the general meaning and doesn't lock its users on one particular build system: users are free to use make/Ninja/etc to actually build the products; and that's a huge advantage since neither build tool is suitable in all situations.

Other times, it shows up to explain why something doesn't work how you'd expect in CMake. Several other blog posts make this claim to explain why globbing for sources is discouraged in CMake.

In these cases, I think it's much more useful to be precise and say "CMake is not Make" as shorthand for the full truth: CMake's abstract build model must trade-off between being a leaky abstraction and constraining itself to the least common denominator among its targets. Just because you can glob for sources in GNU Make doesn't mean that it's appropriate to do in CMake (I could write a whole article on just this point; maybe I will). The reason for this isn't because "CMake is not a build system", it's because globbing happens during metaprogramming and doesn't make it into the final program (with 3.12+ there's CONFIGURE_DEPENDS, but it's unreliable, and the devs still discourage it).

There are certainly deficiencies in CMake's abstract model and (especially) its metaprogramming / scripting language. I think it's more productive to talk about those things in clear language than it is to wave your hands and say "CMake is not a build system".

Building a Dual Shared and Static Library with CMake

2021-03-06T21:37:00-08:00

When packaging software libraries, it is a common requirement to deploy both a static and a shared version. However, CMake library targets are always either one or the other. How do we make it easy for our users to choose which one they want to link to, and why is this difficult to begin with?

In this article we're going to design a CMake build and find_package script that enables library users to easily choose and switch between the two library types. This also serves as a basic project template for a modern CMake library build. The main thing it's missing is handling dependencies.

TLDR: See this GitHub repo with the full code, complete with GitHub Actions testing.

Design Philosophy

So why is it tricky to provide both a static and shared version of a library in CMake? The core issue is that a CMake library target models the build and usage requirements for a single library configuration. When you import SomeLib::SomeLib from a package, the library type is already determined by the time you link another target to it. On the build side, this means that a single library target corresponds to a single physical library on the system.

Static and shared libraries are typically produced from the same set of sources, too, so new CMake users sometimes expect that a single call to add_library will provide whatever mix of types they want. However, this is fundamentally incompatible with CMake's model of linking, which admits no properties on the link itself. It would also make it harder to make independent decisions about position-independent code. Although most desktop systems (especially Linux) favor PIC for its security benefits (see: ASLR), many embedded systems with slow CPUs and strict power budgets either don't want or can't afford the overhead and prefer to link statically. This often means that static and shared libraries cannot share object files.

There's also no good guidance inside CMake for solving this problem from the find_package side. Some modules, like FindCUDAToolkit, use separate targets for each type. Others, like FindHDF5 and FindOpenSSL, use variables with no common convention: HDF5 uses HDF5_USE_STATIC_LIBRARIES while OpenSSL uses OPENSSL_USE_STATIC_LIBS.

So instead of copying a convention that doesn't exist, we will follow a few guiding principles while trying to establish a new convention:

The build interface should match the install interface. It is increasingly common to directly integrate third-party builds with the primary build using add_subdirectory or FetchContent. The end-user experience should not change when switching between these options and find_package.
Only strict build requirements belong in CMakeLists.txt. Anything that isn't absolutely necessary inevitably becomes an imposition on the end user. For a common example, if the end user compiles with -Werror and you compile with -Wall, then their compiler might throw a warning your compiler didn't. Such settings belong in toolchain files or presets files (CMake 3.19+).
A single project will not mix both shared and static versions of a library. Certainly for a single target, it is totally illegal to link to both at the same time. This means we don't need to support mixing both types in a single directory.

The bar for clean CMake code is significantly higher for a library than for an application because the CMake code itself affects end users. For an application, some ugliness is tolerable because it doesn't propagate through the dependency tree (you don't typically link to executables). If an application does not provide a CMake package or if the package it provides is broken, it is easy enough to call find_program and have everything you need. On the other hand, a bad CMake build might require complete replacement by a package maintainer. This is a surprisingly common scenario in vcpkg and is the ultimate condemnation of the upstream build. Don't write builds that have to be thrown out like this.

A Common but Flawed Solution

On the build side, a common solution is to create one target for each library type and give them separate names, like so:

set(sources ...)
add_library(SomeLib_static STATIC ${sources})
add_library(SomeLib_shared SHARED ${sources})

Unfortunately, this fails to meet our design criteria.

Most users who invoke the build directly need only one of the two types, so this approach doubles the compilation time for them. Using an object library doesn't help since it would force position independent code on the static library. Although users who directly include the build may use EXCLUDE_FROM_ALL to build only what is needed, this is a relatively obscure feature and requires extra code in the FetchContent case.

If your package exports just these targets, it forces the user to make an up-front decision about whether to link statically or dynamically and then propagates that decision transitively. Often, the decision whether to use static or dynamic libraries belongs to the package maintainer. For instance, Linux distributions generally require their packages to not have statically linked dependencies and prefer libraries to dynamically link to system packages. It has to be possible to create and install only one of these libraries, without patching your build or your users' builds.

Robert Schumacher, a lead developer for vcpkg, cautions against this exact practice in both his CppCon 2018 and CppCon 2019 talks. In another talk, he explains that vcpkg is sometimes forced to inject code that redirects the static target to the shared one (or vice versa) when only one was built and installed.

The Ideal User Experience

So what should we do? Let's start by examining the ideal user experience for using our library.

cmake_minimum_required(VERSION 3.19)
project(example)

find_package(SomeLib REQUIRED)

add_executable(main main.cpp)
target_link_libraries(main PRIVATE SomeLib::SomeLib)

This looks great, but... there's nothing in there that says whether SomeLib::SomeLib should be shared or static! How does this solve anything?

Normally, the user sets SomeLib_ROOT or CMAKE_PREFIX_PATH to a path that contains exactly the one version of SomeLib at configure time. We need to keep supporting that pattern in our solution, but we also need to support a distribution that contains both versions.

Our first major insight is this: because the build interface should match the install interface, SomeLib::SomeLib should respect BUILD_SHARED_LIBS the same way FetchContent or add_subdirectory would. However, overriding this (or any) variable for one find_package call is a bit complicated. The fully correct version—that preserves the existence and values of BUILD_SHARED_LIBS no matter whether it is a cache or normal variable—is this:

function(find_somelib)
    set(BUILD_SHARED_LIBS YES)
    find_package(SomeLib REQUIRED)
endfunction()

find_somelib()

When find_somelib() is called, it creates a new variable scope that is destroyed when it returns. Thus, the variable environment after the SomeLib package search succeeds is the same as it was before, so code that cares whether BUILD_SHARED_LIBS is a normal or cache variable (or defined at all) continues to work correctly. Saving and restoring the value of BUILD_SHARED_LIBS in the obvious way requires, first, a temporary variable and, second, a check before writing to BUILD_SHARED_LIBS whether it was defined to begin with.

On the other hand, the function also erases potentially useful variables set by the package. The targets are tied to the directory scope, so linking to SomeLib::SomeLib still works. If the package only provides targets, this is not an issue. If some variables are needed, one could set variables in the parent scope via set(... PARENT_SCOPE), but this is awful.

Rather than forcing users to create bespoke functions to override a standard variable, the package will respect a new variable, SomeLib_SHARED_LIBS, that overrides BUILD_SHARED_LIBS. So now we can specify that we want shared libs from SomeLib at the command line with -DSomeLib_SHARED_LIBS=YES or we can enforce it in the CMakeLists.txt by simply setting it.

set(SomeLib_SHARED_LIBS YES)
find_package(SomeLib REQUIRED)

However, BUILD_SHARED_LIBS is supposed to be reserved for the user and not set by the build. It's no different for SomeLib_SHARED_LIBS; users should expect this variable to be respected as a configuration point. To enable a user to truly force SomeLib to be static or shared, we can use find_package's components mechanism:

find_package(SomeLib REQUIRED shared)  # or `static`

It is an error to request both static and shared components. If a single build needs both, it may separate its targets into two directories and call find_package with different components in each one. Since imported targets are not global by default, this works without any intervention on our part.

The Implementation

So now let's make this work! We're going to implement this around a very simple library that returns a random number. Here's the source file:

// src/random.cpp
#include "somelib/random.h"

namespace SomeLib {

// Thanks to XKCD 221 for this useful function!
int getRandomNumber() {
    return 42;  // chosen by fair dice roll.
                // guaranteed to be random.
}

}  // namespace SomeLib

Here's the corresponding header:

// include/somelib/random.h
#ifndef SOMELIB_RANDOM_H
#define SOMELIB_RANDOM_H

#include "somelib/export.h"

namespace SomeLib {

SOMELIB_EXPORT int getRandomNumber();

}

#endif  //SOMELIB_RANDOM_H

export.h is a generated export header that CMake will create for us. It provides the SOMELIB_EXPORT macro which tells the compiler which symbols to expose from the shared version of our library.

Build rules

Now the start of the build is mostly boilerplate.

cmake_minimum_required(VERSION 3.19)
project(SomeLib VERSION 1.0.0)

if (NOT DEFINED CMAKE_CXX_VISIBILITY_PRESET AND
    NOT DEFINED CMAKE_VISIBILITY_INLINES_HIDDEN)
  set(CMAKE_CXX_VISIBILITY_PRESET hidden)
  set(CMAKE_VISIBILITY_INLINES_HIDDEN YES)
endif ()

Since CMake doesn't warn you if you use a feature that is too new for the minimum version you should always specify the minimum version that you actually test with.

The next two lines ensure that the shared library version doesn't export anything unintentionally. MSVC hides symbols by default, whereas GCC and Clang export everything. Exporting unintended symbols can cause conflicts and ODR violations as dependencies are added down the line, so libraries should always make their exports explicit (or at least use a linker script if retrofitting the code is too much). Still, if the user manually specifies a different setting, then we respect it.

Next, we'll implement the SomeLib_SHARED_LIBS override for the build interface that was discussed earlier.

if (DEFINED SomeLib_SHARED_LIBS)
    set(BUILD_SHARED_LIBS "${SomeLib_SHARED_LIBS}")
endif ()

Now we can create the library. To keep the build and install interfaces consistent, we also create an alias SomeLib::SomeLib. The version properties make sure that namelinks and solinks are created for the shared library.

add_library(SomeLib src/random.cpp)
add_library(SomeLib::SomeLib ALIAS SomeLib)
set_target_properties(SomeLib PROPERTIES
                      VERSION ${SomeLib_VERSION}
                      SOVERSION ${SomeLib_VERSION_MAJOR})
target_include_directories(
    SomeLib PUBLIC "$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>")
target_compile_features(SomeLib PUBLIC cxx_std_17)

This assumes that we are using semantic versioning for the joint package and library version. Next we'll create the export header we saw earlier and attach it to the target. The GenerateExportHeader module assumes it's acting on a shared library, so we have to manually add SOMELIB_STATIC_DEFINE to the static build to avoid linker errors arising from DLL-import directives on Windows.

include(GenerateExportHeader)
generate_export_header(SomeLib EXPORT_FILE_NAME include/somelib/export.h)
target_compile_definitions(
    SomeLib PUBLIC "$<$<NOT:$<BOOL:${BUILD_SHARED_LIBS}>>:SOMELIB_STATIC_DEFINE>")
target_include_directories(
    SomeLib PUBLIC "$<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}/include>")

It would be very nice if generate_export_header set up the definitions and include paths automatically. This is the kind of busy-work that gives CMake a bad rap.

Finally, we'll add some packaging logic, but include it by default only if we're the top-level project. That insulates FetchContent users from our install rules if they don't want them, but keeps them available in case they do:

string(COMPARE EQUAL "${CMAKE_SOURCE_DIR}" "${CMAKE_CURRENT_SOURCE_DIR}" is_top_level)
option(SomeLib_INCLUDE_PACKAGING "Include packaging rules for SomeLib" "${is_top_level}")
if (SomeLib_INCLUDE_PACKAGING)
    add_subdirectory(packaging)
endif ()

Packaging

Now we'll take a look at what goes into the packaging/CMakeLists.txt file.

include(GNUInstallDirs)
include(CMakePackageConfigHelpers)

if (NOT DEFINED SomeLib_INSTALL_CMAKEDIR)
   set(SomeLib_INSTALL_CMAKEDIR "${CMAKE_INSTALL_LIBDIR}/cmake/SomeLib"
       CACHE STRING "Path to SomeLib CMake files")
endif ()

The GNUInstallDirs module defines a bunch of variables that control the default behavior of the install() commands and picks sane defaults for every supported platform, including Windows. The name is mostly historical and should probably be changed. We'll use CMakePackageConfigHelpers later to create a required version compatibility script.

Since various package management systems (like vcpkg, Nuget, APT, etc.) have different standards for where to place CMake package config scripts, we create a cache variable, SomeLib_INSTALL_CMAKEDIR, to allow our users to control where those scripts go. We pick a common, safe default.

Now we'll add the logic to install our libraries and headers:

install(TARGETS SomeLib EXPORT SomeLib_Targets
        RUNTIME COMPONENT SomeLib_Runtime
        LIBRARY COMPONENT SomeLib_Runtime
        NAMELINK_COMPONENT SomeLib_Development
        ARCHIVE COMPONENT SomeLib_Development
        INCLUDES DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}")

install(DIRECTORY "${SomeLib_SOURCE_DIR}/include/" "${SomeLib_BINARY_DIR}/include/"
        TYPE INCLUDE
        COMPONENT SomeLib_Development)

When we install SomeLib, we add it to an export set called SomeLib_Targets. To support users to wish to package our library in separate runtime and development components, we create prefixed component names (to avoid clashes with other projects). We won't dwell on componentized packages here, but if you've ever noticed that Ubuntu provides separate libfoo and libfoo-dev packages, that's what this is for. To learn more, watch Craig Scott's CppCon 2019 talk, "Deep CMake for Library Authors".

Now we'll export our targets to a file specific to the library type:

if (BUILD_SHARED_LIBS)
    set(type shared)
else ()
    set(type static)
endif ()

install(EXPORT SomeLib_Targets
        DESTINATION "${SomeLib_INSTALL_CMAKEDIR}"
        NAMESPACE SomeLib::
        FILE SomeLib-${type}-targets.cmake
        COMPONENT SomeLib_Development)

When the library is built as a shared library, we get SomeLib-shared-targets.cmake and when it's built as a static library, we get SomeLib-static-targets.cmake. To turn this into a bona-fide CMake package, we need two files: SomeLibConfig.cmake and SomeLibConfigVersion.cmake. The latter is easy to auto-generate since we're using semantic versioning:

write_basic_package_version_file(
    SomeLibConfigVersion.cmake
    COMPATIBILITY SameMajorVersion)

The purpose of this file is to support the version number argument to find_package. It prevents an incompatible package from being loaded when a version number is specified. The meat of the CMake package is defined in SomeLibConfig.cmake, but we'll discuss that in just a moment. The last rule places these two files in the CMake installation directory.

install(FILES
        "${CMAKE_CURRENT_SOURCE_DIR}/SomeLibConfig.cmake"
        "${CMAKE_CURRENT_BINARY_DIR}/SomeLibConfigVersion.cmake"
        DESTINATION "${SomeLib_INSTALL_CMAKEDIR}"
        COMPONENT SomeLib_Development)

Now we'll see the package config file SomeLibConfig.cmake in all its glory.

cmake_minimum_required(VERSION 3.19)

set(SomeLib_known_comps static shared)
set(SomeLib_comp_static NO)
set(SomeLib_comp_shared NO)
foreach (SomeLib_comp IN LISTS ${CMAKE_FIND_PACKAGE_NAME}_FIND_COMPONENTS)
    if (SomeLib_comp IN_LIST SomeLib_known_comps)
        set(SomeLib_comp_${SomeLib_comp} YES)
    else ()
        set(${CMAKE_FIND_PACKAGE_NAME}_NOT_FOUND_MESSAGE
            "SomeLib does not recognize component `${SomeLib_comp}`.")
        set(${CMAKE_FIND_PACKAGE_NAME}_FOUND FALSE)
        return()
    endif ()
endforeach ()

if (SomeLib_comp_static AND SomeLib_comp_shared)
    set(${CMAKE_FIND_PACKAGE_NAME}_NOT_FOUND_MESSAGE
        "SomeLib `static` and `shared` components are mutually exclusive.")
    set(${CMAKE_FIND_PACKAGE_NAME}_FOUND FALSE)
    return()
endif ()

set(SomeLib_static_targets "${CMAKE_CURRENT_LIST_DIR}/SomeLib-static-targets.cmake")
set(SomeLib_shared_targets "${CMAKE_CURRENT_LIST_DIR}/SomeLib-shared-targets.cmake")

macro(SomeLib_load_targets type)
    if (NOT EXISTS "${SomeLib_${type}_targets}")
        set(${CMAKE_FIND_PACKAGE_NAME}_NOT_FOUND_MESSAGE
            "SomeLib `${type}` libraries were requested but not found.")
        set(${CMAKE_FIND_PACKAGE_NAME}_FOUND FALSE)
        return()
    endif ()
    include("${SomeLib_${type}_targets}")
endmacro()

if (SomeLib_comp_static)
    SomeLib_load_targets(static)
elseif (SomeLib_comp_shared)
    SomeLib_load_targets(shared)
elseif (DEFINED SomeLib_SHARED_LIBS AND SomeLib_SHARED_LIBS)
    SomeLib_load_targets(shared)
elseif (DEFINED SomeLib_SHARED_LIBS AND NOT SomeLib_SHARED_LIBS)
    SomeLib_load_targets(static)
elseif (BUILD_SHARED_LIBS)
    if (EXISTS "${SomeLib_shared_targets}")
        SomeLib_load_targets(shared)
    else ()
        SomeLib_load_targets(static)
    endif ()
else ()
    if (EXISTS "${SomeLib_static_targets}")
        SomeLib_load_targets(static)
    else ()
        SomeLib_load_targets(shared)
    endif ()
endif ()

There are a few confusing things going on here. First, CMake's package search is case-insensitive, so we need to look at ${CMAKE_FIND_PACKAGE_NAME} to know the exact name the user requested and therefore what CMake named the input variables to the package file. If only this were normalized to upper-case, we could write SOMELIB_FIND_COMPONENTS instead of the ugly mess we have, but alas.

Still, what's actually happening is rather simple. It checks the components to see if the user requested either static or shared. If both were, the package fails and sets an informative error message. If just one was, it tries to load the corresponding targets file. If the user supplies an invalid component, it fails, too. Otherwise, it checks SomeLib_SHARED_LIBS, and BUILD_SHARED_LIBS in turn and defaults to static if nothing is set, which matches common practice.

The package components and SomeLib_SHARED_LIBS variable are considered binding if set, so the package will fail to be found if the installation does not contain the requested libraries. However, if only BUILD_SHARED_LIBS is set (or nothing is set) and only one of the static or shared configuration is installed, we still load the available library to match existing CMake practices. If BUILD_SHARED_LIBS is OFF (or not set) and only the shared libraries are available, then the shared libraries will be loaded.

Building the Project

Whew. After all that, you'll be happy to know that actually building this requires nothing special. Here you go (from the source directory):

$ cmake -G Ninja -S . -B build-shared -DBUILD_SHARED_LIBS=YES -DCMAKE_BUILD_TYPE=Release
$ cmake -G Ninja -S . -B build-static -DBUILD_SHARED_LIBS=NO  -DCMAKE_BUILD_TYPE=Release
$ cmake --build build-shared
$ cmake --build build-static
$ cmake --install build-shared --prefix _install
$ cmake --install build-static --prefix _install

None of this should be surprising. We build and install both library types in Release mode to a common prefix. On Windows, we need to be careful that the static library .lib file does not conflict with the shared library's .lib import library. We can work around this by adding -DCMAKE_RELEASE_POSTFIX=_static to the configure step for the static library. That way we'll get SomeLib_static.lib from the static build and the usual SomeLib.dll plus SomeLib.lib combination from the shared build.

Now we can write a little program that calls it. Here's the test:

// main.cpp
#include <iostream>
#include <somelib/random.h>

int main() {
    std::cout << "My very random number is: " << SomeLib::getRandomNumber() << "\n";
    return 0;
}

Here's the CMakeLists.txt:

cmake_minimum_required(VERSION 3.19)
project(example)

enable_testing()

find_package(SomeLib 1 REQUIRED)

add_executable(main main.cpp)
target_link_libraries(main PRIVATE SomeLib::SomeLib)

add_test(NAME random_is_42 COMMAND main)
set_tests_properties(random_is_42 PROPERTIES
                     PASS_REGULAR_EXPRESSION "is: 42"
                     ENVIRONMENT "PATH=$<TARGET_FILE_DIR:SomeLib::SomeLib>")

It also includes a little test to make sure that our very random number was indeed returned. We can build it several ways and verify with ldd (on Linux, at least) that it was linked correctly.

$ cmake -G Ninja -S . -B build -DCMAKE_PREFIX_PATH=/path/to/_install
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /path/to/build
$ cmake --build build
[1/2] /usr/bin/c++ -DSOMELIB_STATIC_DEFINE -isystem /path/to/_install/include  ↩
  -MD -MT CMakeFiles/main.dir/main.cpp.o -MF CMakeFiles/main.dir/main.cpp.o.d  ↩
  -o CMakeFiles/main.dir/main.cpp.o -c ../main.cpp
[2/2] : && /usr/bin/c++   CMakeFiles/main.dir/main.cpp.o -o main               ↩
  /path/to/_install/lib/libSomeLib.a && :
$ ./build/main
My very random number is: 42
$ ldd build/main | grep SomeLib
$ cmake -B build -DBUILD_SHARED_LIBS=YES
-- Configuring done
-- Generating done
-- Build files have been written to: /path/to/build
$ cmake --build build
[1/2] /usr/bin/c++  -isystem /path/to/_install/include  -MD -MT                ↩
  CMakeFiles/main.dir/main.cpp.o -MF CMakeFiles/main.dir/main.cpp.o.d -o       ↩
  CMakeFiles/main.dir/main.cpp.o -c ../main.cpp
[2/2] : && /usr/bin/c++   CMakeFiles/main.dir/main.cpp.o -o main               ↩
  -Wl,-rpath,/path/to/_install/lib  /path/to/_install/lib/libSomeLib.so.1.0.0  ↩
  && :
$ ./build/main
My very random number is: 42
$ ldd build/main | grep SomeLib
        libSomeLib.so.1 => /path/to/libSomeLib.so.1 (0x00007f41880ae000)

The associated GitHub repo has a simple GitHub Actions workflow to test the package.

Conclusion

There's a lot awkward about CMake, and it's definitely on display here. Even so, the actual solution itself is simple, even if the implementation has some warts. Most importantly, the complexity is all placed on the library author, not on the library user. A lot of this can be set up and forgotten, and the little pain now is well worth sparing all the downstream users, support staff, StackOverflow volunteers, and so on a far greater amount of pain.

Perceus: Garbage Free Reference Counting with Reuse

2021-01-25T12:00:00-08:00

PLDI 2021 Distinguished Paper

Link to paper

We introduce Perceus, an algorithm for precise reference counting with reuse and specialization. Starting from a functional core language with explicit control-flow, Perceus emits precise reference counting instructions such that programs are garbage-free, where only live references are retained.This enables further optimizations, like reuse analysis that allows for guaranteed in-place updates at runtime. This in turn enables a novel programming paradigm that we call functional but in-place (FBIP). Much like tail-call optimization enables writing loops with regular function calls, reuse analysis enables writing in-place mutating algorithms in a purely functional way. We give a novel formalization of reference counting in a linear resource calculus, and prove that Perceus is sound and garbage free. We show evidence that Perceus, as implemented in Koka, has good performance and is competitive with other state-of-the-art memory collectors.

Formal Semantics for the Halide Language

2020-05-01T12:00:00-07:00

Pre-print on arXiv

We present the first formalization and metatheory of language soundness for a user-schedulable language, the widely used array processing language Halide. User-schedulable languages strike a balance between abstraction and control in high-performance computing by separating the specification of what a program should compute from a schedule for how to compute it. In the process, they make a novel language soundness claim: the result of a program should always be the same, regardless of how it is scheduled. This soundness guarantee is tricky to provide in the presence of schedules that introduce redundant recomputation and computation on uninitialized data, rather than simply reordering statements. In addition, Halide ensures memory safety through a compile-time bounds inference engine that determines safe sizes for every buffer and loop in the generated code, presenting a novel challenge: formalizing and analyzing a language specification that depends on the results of unreliable program synthesis algorithms. Our formalization has revealed flaws and led to improvements in the practical Halide system, and we believe it provides a foundation for the design of new languages and tools that apply programmer-controlled scheduling to other domains.

A Type-Directed Approach to Program Repair

2015-07-16T02:48:26-07:00

Published at CAV 2015.

Link to paper

Developing enterprise software often requires composing several libraries together with a large body of in-house code. Large APIs introduce a steep learning curve for new developers as a result of their complex object-oriented underpinnings. While the written code in general reflects a programmer’s intent, due to evolutions in an API, code can often become ill-typed, yet still syntactically-correct. Such code fragments will no longer compile, and will need to be updated. We describe an algorithm that automatically repairs such errors, and discuss its application to common problems in software engineering.