Alex Reinkinghttps://alexreinking.com/2022-12-22T00:12:00-08:00I Finished my Ph.D.!2022-12-22T00:12:00-08:002022-12-22T00:12:00-08:00Alex Reinkingtag:alexreinking.com,2022-12-22:/blog/i-finished-my-phd.html<p>It took six and a half years, but I'm happy to announce that I finally got my
Ph.D. in Computer Science. Hooray! <del>As I write this, I'm starting a
short-term post-doc at MIT to wrap up a few research projects, but I'm actively
applying to jobs. If you have …</del></p><p>It took six and a half years, but I'm happy to announce that I finally got my
Ph.D. in Computer Science. Hooray! <del>As I write this, I'm starting a
short-term post-doc at MIT to wrap up a few research projects, but I'm actively
applying to jobs. If you have a role that aligns with my skill set, please let
me know!</del></p>
<blockquote>
<p><strong>Update:</strong> as of Feb 21, 2023, I am now working at Qualcomm AI Research!</p>
</blockquote>
<p>My thesis is titled <em>The Design and Implementation of User-Schedulable
Languages</em>. A copy is available from <a href="https://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-271.html">UC Berkeley's website</a>, but I
will update it here if the need for errata should ever arise.</p>
<p>I'm happy to have <em>substantially</em> more time than I did during my thesis crunch.
If you've been following my work at all, please reach out! My contact info is in
the sidebar. I'd love to talk about my research, potential collaborations,
software engineering, or whatever else.</p>
<p>I also plan to start blogging somewhat consistently; I would love to expand
the "practical considerations for DSL design in the C ecosystem" chapter of my
thesis, and this blog seems like a good place to do it. These plans include my
somewhat popular "CMake Without the Agonizing Pain" series.</p>
<p>Finally, I want to take a moment to thank the many people who helped me
throughout grad school: first of all, my advisor <a href="https://people.csail.mit.edu/jrk">Jonathan Ragan-Kelley</a>.
I'm honored to be your first graduating student! I'd also like to thank my
amazing colleagues and
collaborators: <a href="http://www.gilbertbernstein.com">Gilbert Bernstein</a>, <a href="http://people.csail.mit.edu/yuka">Yuka Ikarashi</a>,
<a href="https://hngenc.github.io">Hasan Genc</a>, <a href="https://www.microsoft.com/en-us/research/people/daan">Daan Leijen</a>, <a href="https://xnning.github.io">Ningning Xie</a>, <a href="https://www.microsoft.com/en-us/research/people/leonardo">Leonardo de
Moura</a>, <a href="https://dougalmaclaurin.com">Dougal Maclaurin</a>, <a href="https://github.com/apaszke">Adam Paszke</a>, <a href="https://www.linkedin.com/in/alexey-radul-715a581">Alexey Radul</a>,
<a href="https://ankushdesai.github.io">Ankush Desai</a>, <a href="https://www.linkedin.com/in/shaz-qadeer-88b3332">Shaz Qadeer</a>, and <a href="https://www.cs.yale.edu/homes/piskac/">Ruzica Piskac</a>.</p>
<p>I especially want to thank the <a href="https://halide-lang.org">Halide</a> team, most
notably <a href="https://github.com/steven-johnson">Steven Johnson</a>, <a href="https://andrew.adams.pub">Andrew Adams</a>, <a href="https://people.csail.mit.edu/skamil">Shoiab Kamil</a>,
and <a href="https://www.linkedin.com/in/zalman-stern-0065471">Zalman Stern</a>, for entrusting me with so much of the project.</p>
<p>And of course, a big thanks to all my friends and family in Berkeley,
Minneapolis, Boston, and beyond who kept me sane all these years. You know who
you are and that you mean the world to me 🙂.</p>
<p>Until next time, happy holidays and a happy new year!</p>Exocompilation for Productive Programming of Hardware Accelerators2022-06-09T12:00:00-07:002022-06-09T12:00:00-07:00Alex Reinkingtag:alexreinking.com,2022-06-09:/papers/exocompilation-for-productive-programming-of-hardware-accelerators.html<p>Published at PLDI 2022.</p>
<p><a href="https://dl.acm.org/doi/abs/10.1145/3519939.3523446">Link to paper</a></p>
<p>High-performance kernel libraries are critical to exploiting accelerators and
specialized instructions in many applications. Because compilers are difficult
to extend to support diverse and rapidly-evolving hardware targets, and
automatic optimization is often insufficient to guarantee state-of-the-art
performance, these libraries are commonly still coded …</p><p>Published at PLDI 2022.</p>
<p><a href="https://dl.acm.org/doi/abs/10.1145/3519939.3523446">Link to paper</a></p>
<p>High-performance kernel libraries are critical to exploiting accelerators and
specialized instructions in many applications. Because compilers are difficult
to extend to support diverse and rapidly-evolving hardware targets, and
automatic optimization is often insufficient to guarantee state-of-the-art
performance, these libraries are commonly still coded and optimized by hand, at
great expense, in low-level C and assembly. To better support development of
high-performance libraries for specialized hardware, we propose a new
programming language, Exo, based on the principle of <em>exocompilation</em>:
externalizing target-specific code generation support and optimization policies
to user-level code. Exo allows custom hardware instructions, specialized
memories, and accelerator configuration state to be defined in user libraries.
It builds on the idea of user scheduling to externalize hardware mapping and
optimization decisions. Schedules are defined as composable rewrites within the
language, and we develop a set of effect analyses which guarantee program
equivalence and memory safety through these transformations. We show that Exo
enables rapid development of state-of-the-art matrix-matrix multiply and
convolutional neural network kernels, for both an embedded neural accelerator
and x86 with AVX-512 extensions, in a few dozen lines of code each.</p>How to Use CMake Without the Agonizing Pain - Part 22021-05-31T21:37:00-07:002021-05-31T21:37:00-07:00Alex Reinkingtag:alexreinking.com,2021-05-31:/blog/how-to-use-cmake-without-the-agonizing-pain-part-2.html<p>Welcome back to Part 2 of this series! I was very happy to see the warm
reception <a href="/blog/how-to-use-cmake-without-the-agonizing-pain-part-1.html">Part 1</a> got over on <a href="https://www.reddit.com/r/cpp/comments/nitvir/how_to_use_cmake_without_the_agonizing_pain_part_1/">/r/cpp</a>. Before we get started, I
thought I would take this opportunity to clarify a couple of points about this
series.</p>
<p>First, <strong>this series is not a …</strong></p><p>Welcome back to Part 2 of this series! I was very happy to see the warm
reception <a href="/blog/how-to-use-cmake-without-the-agonizing-pain-part-1.html">Part 1</a> got over on <a href="https://www.reddit.com/r/cpp/comments/nitvir/how_to_use_cmake_without_the_agonizing_pain_part_1/">/r/cpp</a>. Before we get started, I
thought I would take this opportunity to clarify a couple of points about this
series.</p>
<p>First, <strong>this series is not a tutorial</strong>, at least not in the traditional sense.
My hope with this project is to show you <em>how to reason</em> about CMake so that it
feels intuitive. I want readers to see the big picture and to develop a taste
for quality build code. Still, there will be some space dedicated to exploring
specific effective practices, and pointing out common mistakes, superseded
features, etc. but all with an eye towards understanding <em>why</em>.</p>
<p>Second, while I complained about the ocean of bad CMake resources, I forgot to
recognize the handful of good resources that have taught me well. <strong>I have added
a list of these resources</strong> to the end of <a href="/blog/how-to-use-cmake-without-the-agonizing-pain-part-1.html">Part 1</a>.</p>
<p>Today, I'd like to talk about <strong>what you should expect</strong> from a CMake build, and
some common pitfalls that violate these expectations. Not every CMake project
you encounter will meet these criteria. I would encourage you to begin a
friendly dialogue with the maintainers of non-conforming projects to see if they
can be fixed (and, in the spirit of open source, try opening a PR!).</p>
<h2>Expect vanilla builds to work</h2>
<p>I'm going to make a bold claim, here: it should be possible to build <em>any</em> CMake
project using <em>any</em> generator with the following sequence of commands, assuming
all its dependencies are installed to system locations:</p>
<div class="highlight"><pre><span></span><code><span class="gp"># </span>For<span class="w"> </span>a<span class="w"> </span>single-configuration<span class="w"> </span>generator:
<span class="gp">$ </span>cmake<span class="w"> </span>-S<span class="w"> </span>.<span class="w"> </span>-B<span class="w"> </span>build<span class="w"> </span>-DCMAKE_BUILD_TYPE<span class="o">=</span>Release
<span class="gp">$ </span>cmake<span class="w"> </span>--build<span class="w"> </span>build
<span class="gp">$ </span>cmake<span class="w"> </span>--install<span class="w"> </span>build<span class="w"> </span>--prefix<span class="w"> </span>/path/to/wherever
<span class="gp"># </span>For<span class="w"> </span>a<span class="w"> </span>multi-configuration<span class="w"> </span>generator:
<span class="gp">$ </span>cmake<span class="w"> </span>-S<span class="w"> </span>.<span class="w"> </span>-B<span class="w"> </span>build
<span class="gp">$ </span>cmake<span class="w"> </span>--build<span class="w"> </span>build<span class="w"> </span>--config<span class="w"> </span>Release
<span class="gp">$ </span>cmake<span class="w"> </span>--install<span class="w"> </span>build<span class="w"> </span>--config<span class="w"> </span>Release<span class="w"> </span>--prefix<span class="w"> </span>/path/to/wherever
</code></pre></div>
<p>Furthermore, if the code is standards-compliant and platform-independent, this
sequence should work with <em>any</em> compiler on <em>any</em> operating system.</p>
<h3>Pitfall: unnecessary flags and settings</h3>
<p>Obviously, if you're building a Linux-only tool that depends on GNU extensions,
then you will need GCC or Clang. Unfortunately, many CMake builds assume too
much about the environment or toolchain and inject optional, compiler-specific
flags into their builds. Often, they provide no way to disable them. Such
projects might needlessly fail on a different compiler or even a different
version of the same compiler used by the author.</p>
<p>The most common example is adding <code>-Werror</code> unconditionally. The meaning
of <code>-Wall</code> changes across compiler versions, so while this code might work for
you today, it is at high risk of bit-rotting:</p>
<div class="highlight"><pre><span></span><code><span class="c"># BAD: don't do this!</span>
<span class="nf">target_compile_options(</span><span class="nb">target</span><span class="w"> </span><span class="no">PRIVATE</span><span class="w"> </span><span class="p">-</span><span class="nb">Wall</span><span class="w"> </span><span class="p">-</span><span class="nb">Werror</span><span class="nf">)</span>
</code></pre></div>
<p>For a subtler example, both GCC and Clang provide warning flags for missing uses
of the C++11 <code>override</code> keyword. On GCC 5.1 and newer, it's <code>-Wsuggest-override</code>
and on Clang 10 and below the check is split between two
flags: <code>-Winconsistent-missing-destructor-override</code> and
<code>-Winconsistent-missing-override</code>. Providing a Clang-only flag to GCC will throw
an error, and providing the GCC-only flag to Clang will produce a warning that
may be upgraded to an error if <code>-Werror</code> is also specified. Thus, if you naively
write</p>
<div class="highlight"><pre><span></span><code><span class="c"># BAD: don't do this!</span>
<span class="nf">target_compile_options(</span><span class="nb">target</span><span class="w"> </span><span class="no">PRIVATE</span><span class="w"> </span><span class="p">-</span><span class="nb">Winconsistent-missing-override</span><span class="nf">)</span>
</code></pre></div>
<p>then your build will break with GCC! If you add <code>-Wsuggest-override</code> like this,
then your build will break with <code>-Werror</code> on Clang 10! Ask yourself: <strong>do you
really want to track warning flag compatibility across compiler vendors and
versions? Is that a good use of your time?</strong></p>
<p>I'm here to tell you that <strong>you don't want to</strong>, and that <strong>it's a waste of
time</strong>. You can save yourself a lot of hassle if you <strong>only include firm build
requirements in the CMakeLists.txt</strong>. Your code <em>will</em> build without any
warnings enabled, so they don't belong there. In the past, you would have needed
to create a toolchain file or at least guard these settings with the appropriate
checks and <code>options()</code> to disable them. However, since CMake 3.19, you can add
these to a <em><a href="https://cmake.org/cmake/help/latest/manual/cmake-presets.7.html">preset</a></em>. Create a file named <code>CMakePresets.json</code> next to
your <code>CMakeLists.txt</code> with these contents:</p>
<div class="highlight"><pre><span></span><code><span class="p">{</span>
<span class="w"> </span><span class="nt">"version"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"cmakeMinimumRequired"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nt">"major"</span><span class="p">:</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"minor"</span><span class="p">:</span><span class="w"> </span><span class="mi">19</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"patch"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span>
<span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="nt">"configurePresets"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
<span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nt">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gcc"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"displayName"</span><span class="p">:</span><span class="w"> </span><span class="s2">"GCC"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Default build options for GCC"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"generator"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Ninja"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"binaryDir"</span><span class="p">:</span><span class="w"> </span><span class="s2">"${sourceDir}/build"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"cacheVariables"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nt">"CMAKE_CXX_FLAGS"</span><span class="p">:</span><span class="w"> </span><span class="s2">"-Wsuggest-override"</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nt">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"clang"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"displayName"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Clang"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Default build options for Clang"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"generator"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Ninja"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"binaryDir"</span><span class="p">:</span><span class="w"> </span><span class="s2">"${sourceDir}/build"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"cacheVariables"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nt">"CMAKE_CXX_FLAGS"</span><span class="p">:</span><span class="w"> </span><span class="s2">"-Winconsistent-missing-override -Winconsistent-missing-destructor-override"</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div>
<p>Then someone (an end-user, CI, <em>you</em>) can use your preset like so:</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>cmake<span class="w"> </span>--preset<span class="o">=</span>gcc<span class="w"> </span>-DCMAKE_BUILD_TYPE<span class="o">=</span>Release
<span class="gp">$ </span>cmake<span class="w"> </span>--build<span class="w"> </span>build
</code></pre></div>
<p>Presets will <em>fundamentally change</em> the way people work with CMake and share
their optional (but desired) build settings with users. They also significantly
reduce the risk of your build breaking with a different compiler or version.
<strong>Remember:</strong> it is much easier to write a correct build by keeping your
CMakeLists.txt <em>minimal</em> and writing an opt-in preset than by checking all the
relevant factors (compiler vendor, version, active language, etc.) before adding
a flag.</p>
<p>To really drive this point home, this code safely adds <code>-Wsuggest-override</code>. It
should burn your eyeballs:</p>
<div class="highlight"><pre><span></span><code><span class="c"># My eyes! The goggles do nothing!</span>
<span class="nf">option(</span><span class="nb">MyProj_ENABLE_WARNINGS</span><span class="w"> </span><span class="s">"Compile MyProj with warnings used by upstream"</span><span class="w"> </span><span class="no">OFF</span><span class="nf">)</span>
<span class="nf">if</span> <span class="nf">(</span><span class="nb">MyProj_ENABLE_WARNINGS</span><span class="nf">)</span>
<span class="w"> </span><span class="c"># keep line width low</span>
<span class="w"> </span><span class="nf">set(</span><span class="nb">is_clang</span><span class="w"> </span><span class="s">"$<COMPILE_LANG_AND_ID:CXX,Clang>"</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="nb">is_gcc</span><span class="w"> </span><span class="s">"$<COMPILE_LANG_AND_ID:CXX,GNU>"</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="nb">ver</span><span class="w"> </span><span class="s">"$<CXX_COMPILER_VERSION>"</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">target_compile_options(</span>
<span class="w"> </span><span class="nb">target</span>
<span class="w"> </span><span class="no">PRIVATE</span>
<span class="w"> </span><span class="s">"$<$<AND:${is_clang},$<VERSION_GREATER_EQUAL:${ver},11>>:-Wsuggest-override>"</span>
<span class="w"> </span><span class="s">"$<$<AND:${is_gcc},$<VERSION_GREATER_EQUAL:${ver},5>>:-Wsuggest-override>"</span>
<span class="w"> </span><span class="nf">)</span>
<span class="nf">endif</span> <span class="nf">()</span>
</code></pre></div>
<p>This sort of thing does not scale. If a preset doesn't work for an end-user,
they can override it piecemeal at the command line. On the other hand, incorrect
CMake code <em>inflicts an error</em> with no recourse but to patch your build.</p>
<p>Don't forget that other people besides your core development team will use your
build. Package maintainers, consumers of your library (if applicable), and power
users looking to be on the cutting edge will all want to build your package with
a slightly different set of flags, compilers, versions, and operating systems.
The path of least resistance (using presets) both makes your CMakeLists.txt easy
to maintain <em>for you</em> and easy to consume for <em>all</em> your users.</p>
<h3>Pitfall: bad dependency management</h3>
<p>If you use only well-behaved CMake packages with <code>find_package</code>, this will
largely take care of itself. Unfortunately, many CMake packages are not
well-behaved. To keep this article focused, strategies for wrangling bad CMake
(and non-CMake) dependencies will be covered in Part 3.</p>
<h2>Expect incremental builds to work</h2>
<p>Some particularly pathological projects require you to run CMake <em>twice</em> up
front in order to get a correct build. This should never be the case and is
covered by the one-configure build recipe above. It's also fairly uncommon.</p>
<p>However, disappointingly many projects require you to manually re-run CMake
before any incremental build. The whole point of CMake is to generate faithful
implementations of the abstract build model. One configure step ought to be all
you need. After the first run, the build tool (e.g. <code>make</code>) should know when it
needs to re-run CMake.</p>
<p>The technical term here is <em>"<a href="https://en.wikipedia.org/wiki/Idempotence#Computer_science_meaning">idempotence</a>"</em>: running the CMake configure step
twice with the same inputs should be no different from running it once. Any
other behavior is unfriendly to developers and should be considered a bug with
the project. <em>(Note: Xcode has some architectural limitations that make this
impossible; see <a href="https://discourse.cmake.org/t/documented-criteria-for-build-correctness/3087/2">this Discourse discussion</a> for more details)</em></p>
<h3>Pitfall: terrifying cache behavior</h3>
<p>There are several ways you can unintentionally break idempotence. If you use
<code>set(CACHE)</code>, there's a good chance your build is broken. Here's an example from
a <a href="https://gitlab.kitware.com/cmake/cmake/-/issues/22038">bug report</a> I filed recently. If you were wondering what the "agonizing pain"
I've been talking about is, look no further. This is the kind of thing nobody
should ever need to know in the first place. Suppose you have the following:</p>
<div class="highlight"><pre><span></span><code><span class="nf">cmake_minimum_required(</span><span class="no">VERSION</span><span class="w"> </span><span class="m">3.20</span><span class="nf">)</span>
<span class="nf">project(</span><span class="nb">test</span><span class="w"> </span><span class="no">LANGUAGES</span><span class="w"> </span><span class="no">NONE</span><span class="nf">)</span>
<span class="nf">set(</span><span class="nb">var</span><span class="w"> </span><span class="m">1</span><span class="nf">)</span>
<span class="nf">set(</span><span class="nb">var</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="no">CACHE</span><span class="w"> </span><span class="no">STRING</span><span class="w"> </span><span class="s">""</span><span class="nf">)</span>
<span class="nf">message(</span><span class="no">STATUS</span><span class="w"> </span><span class="s">"var = ${var}"</span><span class="nf">)</span>
</code></pre></div>
<p>What does it print? Let's see:</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>cmake<span class="w"> </span>-S<span class="w"> </span>.<span class="w"> </span>-B<span class="w"> </span>build
<span class="go">-- var = 2</span>
</code></pre></div>
<p>What happened here? Really take a minute to think about what the underlying rule
could be. Now let's try running the same command again, without changing
absolutely anything:</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>cmake<span class="w"> </span>-S<span class="w"> </span>.<span class="w"> </span>-B<span class="w"> </span>build
<span class="go">-- var = 1</span>
</code></pre></div>
<p>Whatever you thought the rule was, I bet you did not expect this. Why is it <code>1</code>,
now?</p>
<p>I'll fill you in: when CMake runs, it loads the cache into a special, global
scope. When <code>set(CACHE)</code> runs, it checks to see if there is already an entry in
the cache. If not, then it creates one and deletes the normal variable binding
to expose the newly cached value. Otherwise, it won't do anything at all
(unless <code>FORCE</code> is specified). Don't ask me how it works if there are multiple
variables of the same name in nested directory or function scopes. I'm not sure
I even want to know.</p>
<p>Now let's try to set the cache variable at the command line:</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>cmake<span class="w"> </span>-S<span class="w"> </span>.<span class="w"> </span>-B<span class="w"> </span>build<span class="w"> </span>-Dvar<span class="o">=</span><span class="m">3</span>
<span class="go">-- var = 3</span>
</code></pre></div>
<p>What happened here?! Neither value mattered! The normal variable won before, but
now <code>set(CACHE)</code> overwrote it? Why? Do command-line variables have their own,
special, innermost scope? Are they immutable?</p>
<p>Well, here's the answer: setting a cache variable at the command line with no
type <em>deletes</em> the type that was already established (what?), and
so <code>set(CACHE)</code> will add the type when it runs (ok...), and when this happens it
will also delete the normal binding as if the variable did not exist at all
(what?!), and that isn't even documented behavior (WHAT?!). If you
use <code>-Dvar:STRING=3</code> instead, then it will print <code>1</code>.</p>
<p>Here's what <a href="https://cmake.org/cmake/help/v3.20/command/set.html">the docs</a> <em>do</em> have to say about this:</p>
<blockquote>
<p><strong>If the cache entry does not exist prior to the call</strong> or <strong>the <code>FORCE</code>
option is given</strong> then the cache entry will be set to the given value.
Furthermore, any normal variable binding in the current scope will be removed
to expose the newly cached value to any immediately following evaluation.</p>
<p>It is possible for the cache entry to exist prior to the call but have no type
set if it was created on the cmake(1) command line by a user through the
<code>-D=<var>=<value></code> option without specifying a type. In this case the set
command will add the type.</p>
</blockquote>
<p>Nowhere in that last sentence does it say it will delete the normal variable
binding when the type is not set. This whole behavior is downright byzantine.</p>
<p>Thankfully, the devs have implemented a policy fix that will ship in CMake 3.21!
With <a href="https://cmake.org/cmake/help/git-master/policy/CMP0126.html"><code>CMP0126</code></a> enabled, the <code>set</code> command will not touch normal
variables, meaning that they always "win". This is how <code>option()</code> works and is
yet another reason to use the newest CMake. Until then, I believe the best
practice is to only cache an existing normal variable, guarded by a check if it
already exists:</p>
<div class="highlight"><pre><span></span><code><span class="nf">if</span> <span class="nf">(</span><span class="no">NOT</span><span class="w"> </span><span class="no">DEFINED</span><span class="w"> </span><span class="nb">var</span><span class="nf">)</span>
<span class="w"> </span><span class="c"># compute default value for var</span>
<span class="w"> </span><span class="nf">set(</span><span class="nb">var</span><span class="w"> </span><span class="s">"${var}"</span><span class="w"> </span><span class="no">CACHE</span><span class="w"> </span><span class="nv"><TYPE></span><span class="w"> </span><span class="s">"doc string"</span><span class="nf">)</span>
<span class="nf">endif</span> <span class="nf">()</span>
</code></pre></div>
<p>This ensures that the value of <code>var</code> is consistent no matter the state of the
cache. After CMake 3.21, you may safely set the cache variable directly and to
any default value.</p>
<h3>Pitfall: configure-step dependencies</h3>
<p>Another common cause of build issues is to fail to declare a dependency for the
configure-step. If your project makes heavy use of <code>execute_process</code>
or otherwise reads and writes files during the configure step, those files
should be added to the <code>CMAKE_CONFIGURE_DEPENDS</code> directory property, like so:</p>
<div class="highlight"><pre><span></span><code><span class="c"># both `.` and `file` are relative to current source directory</span>
<span class="nf">set_property(</span><span class="no">DIRECTORY</span><span class="w"> </span><span class="p">.</span><span class="w"> </span><span class="no">APPEND</span><span class="w"> </span><span class="no">PROPERTY</span><span class="w"> </span><span class="no">CMAKE_CONFIGURE_DEPENDS</span><span class="w"> </span><span class="s">"file"</span><span class="nf">)</span>
</code></pre></div>
<p>This will cause the generated build to check those files and re-run CMake if
they have changed. Some commands, like <code>configure_file</code>, are smart enough to
update this property automatically. Others, like <code>file(COPY)</code> are not;
use <code>configure_file</code> in favor of other "equivalent" commands when you can. Check
the documentation (or better yet, write a test case) if you are ever unsure.</p>
<h3>Pitfall: file globbing</h3>
<p>This same problem also affects globbing for source files in CMake:</p>
<div class="highlight"><pre><span></span><code><span class="c"># WARNING: this code breaks idempotence</span>
<span class="nf">file(</span><span class="no">GLOB</span><span class="w"> </span><span class="nb">sources</span><span class="w"> </span><span class="s">"*.cpp"</span><span class="nf">)</span>
<span class="nf">add_executable(</span><span class="nb">my_app</span><span class="w"> </span><span class="o">${</span><span class="nt">sources</span><span class="o">}</span><span class="nf">)</span>
</code></pre></div>
<p>If you have this code, then adding a new <code>.cpp</code> file to the directory will not
trigger a re-configure in an incremental build. As we discussed above, this is
bad behavior because it forces a developer to re-run CMake as opposed to just
the build tool.</p>
<p>One solution is to use <code>CONFIGURE_DEPENDS</code>, which will cause the generated build
to re-evaluate the globs and re-configure if anything changes. This code
correctly sets dependencies.</p>
<div class="highlight"><pre><span></span><code><span class="c"># This code is fine, but with caveats.</span>
<span class="nf">file(</span><span class="no">GLOB</span><span class="w"> </span><span class="nb">sources</span><span class="w"> </span><span class="no">CONFIGURE_DEPENDS</span><span class="w"> </span><span class="s">"*.cpp"</span><span class="nf">)</span>
<span class="nf">add_executable(</span><span class="nb">my_app</span><span class="w"> </span><span class="o">${</span><span class="nt">sources</span><span class="o">}</span><span class="nf">)</span>
</code></pre></div>
<p>However, the developers do not promise that it will work on every generator.
Here's what the <a href="https://cmake.org/cmake/help/latest/command/file.html#filesystem">documentation</a> says:</p>
<blockquote>
<p><strong>Note:</strong> We do not recommend using GLOB to collect a list of source files
from your source tree. If no CMakeLists.txt file changes when a source is
added or removed then the generated build system cannot know when to ask CMake
to regenerate. The <code>CONFIGURE_DEPENDS</code> flag may not work reliably on all
generators, or if a new generator is added in the future that cannot support
it, projects using it will be stuck. Even if <code>CONFIGURE_DEPENDS</code> works
reliably, there is still a cost to perform the check on every rebuild.</p>
</blockquote>
<p>This is not a theoretical concern: the immensely popular Ninja generator has a
bug until 1.10.2 (which at time of writing is the newest one).
<a href="https://github.com/ninja-build/ninja/issues/1724#issuecomment-677730694">Here is a link</a> to a GitHub issue about this.</p>
<p>I understand this is controversial, but given that the CMake maintainers are so
explicit about not globbing, I strongly believe the best thing to do is to
<strong>list source files explicitly</strong>. In general, it is a good idea to avoid doing
things that are explicitly unsupported because when you run into problems, the
maintainers will simply tell you to fix your code.</p>
<p>Besides, manually listing source files is typically only annoying at the start
of a new project, when the code structure is much more fluid. In the steady
state, file lists change only occasionally, and the pain of updating a file list
is not very great. You can (and should) typically split up your file lists using
<code>target_sources</code> and <code>add_subdirectory</code>. That way no one <code>CMakeLists.txt</code> gets
too long.</p>
<h4>Update: a note on performance</h4>
<p>An earlier version of this article repeated the old saw that globs are slow. In
response to the discussion on Reddit, I ran some tests myself and got a mixed
bag. Here's a table of my results:</p>
<table>
<thead>
<tr>
<th>Disk</th>
<th>Filesystem</th>
<th>OS</th>
<th>Generator</th>
<th>N</th>
<th>Time (s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Samsung SSD 970 EVO</td>
<td>ext4 (WSL)</td>
<td>Ubuntu 20.04 (WSL)</td>
<td>Ninja</td>
<td>1000</td>
<td>0.0069</td>
</tr>
<tr>
<td>SanDisk SDSSDHII</td>
<td>ext4</td>
<td>Ubuntu 20.04</td>
<td>Ninja</td>
<td>1000</td>
<td>0.0162</td>
</tr>
<tr>
<td>SanDisk SDSSDHII</td>
<td>NTFS</td>
<td>Windows 10</td>
<td>Ninja</td>
<td>1000</td>
<td>0.0364</td>
</tr>
<tr>
<td>Samsung SSD 970 EVO</td>
<td>ext4 (WSL)</td>
<td>Ubuntu 20.04 (WSL)</td>
<td>Ninja</td>
<td>10000</td>
<td>0.0481</td>
</tr>
<tr>
<td>SanDisk SDSSDHII</td>
<td>ext4</td>
<td>Ubuntu 20.04</td>
<td>Ninja</td>
<td>10000</td>
<td>0.0594</td>
</tr>
<tr>
<td>SanDisk SDSSDHII</td>
<td>NTFS</td>
<td>Windows 10</td>
<td>VS 2019</td>
<td>1000</td>
<td>0.0731</td>
</tr>
<tr>
<td>Samsung SSD 970 EVO</td>
<td>NTFS</td>
<td>Windows 10</td>
<td>Ninja</td>
<td>1000</td>
<td>0.0832</td>
</tr>
<tr>
<td>Samsung SSD 970 EVO</td>
<td>NTFS</td>
<td>Windows 10</td>
<td>VS 2019</td>
<td>1000</td>
<td>0.1012</td>
</tr>
<tr>
<td>Samsung SSD 970 EVO</td>
<td>NTFS (3g)</td>
<td>Ubuntu 20.04</td>
<td>Ninja</td>
<td>1000</td>
<td>0.1146</td>
</tr>
<tr>
<td>SanDisk SDSSDHII</td>
<td>NTFS (3g)</td>
<td>Ubuntu 20.04</td>
<td>Ninja</td>
<td>1000</td>
<td>0.1170</td>
</tr>
<tr>
<td>SanDisk SDSSDHII</td>
<td>NTFS (9p)</td>
<td>Ubuntu 20.04 (WSL)</td>
<td>Ninja</td>
<td>100</td>
<td>0.2062</td>
</tr>
<tr>
<td>Samsung SSD 970 EVO</td>
<td>NTFS (9p)</td>
<td>Ubuntu 20.04 (WSL)</td>
<td>Ninja</td>
<td>100</td>
<td>0.2268</td>
</tr>
<tr>
<td>SanDisk SDSSDHII</td>
<td>NTFS</td>
<td>Windows 10</td>
<td>Ninja</td>
<td>10000</td>
<td>0.2743</td>
</tr>
<tr>
<td>Samsung SSD 970 EVO</td>
<td>ext4 (WSL)</td>
<td>Ubuntu 20.04 (WSL)</td>
<td>Ninja</td>
<td>100000</td>
<td>0.3712</td>
</tr>
<tr>
<td>SanDisk SDSSDHII</td>
<td>ext4</td>
<td>Ubuntu 20.04</td>
<td>Ninja</td>
<td>100000</td>
<td>0.4383</td>
</tr>
<tr>
<td>SanDisk SDSSDHII</td>
<td>NTFS</td>
<td>Windows 10</td>
<td>VS 2019</td>
<td>10000</td>
<td>0.4710</td>
</tr>
<tr>
<td>Samsung SSD 970 EVO</td>
<td>NTFS</td>
<td>Windows 10</td>
<td>Ninja</td>
<td>10000</td>
<td>0.5616</td>
</tr>
<tr>
<td>Samsung SSD 970 EVO</td>
<td>NTFS</td>
<td>Windows 10</td>
<td>VS 2019</td>
<td>10000</td>
<td>0.8158</td>
</tr>
<tr>
<td>SanDisk SDSSDHII</td>
<td>NTFS (3g)</td>
<td>Ubuntu 20.04</td>
<td>Ninja</td>
<td>10000</td>
<td>1.1119</td>
</tr>
<tr>
<td>Samsung SSD 970 EVO</td>
<td>NTFS (3g)</td>
<td>Ubuntu 20.04</td>
<td>Ninja</td>
<td>10000</td>
<td>1.4825</td>
</tr>
<tr>
<td>SanDisk SDSSDHII</td>
<td>NTFS (9p)</td>
<td>Ubuntu 20.04 (WSL)</td>
<td>Ninja</td>
<td>1000</td>
<td>1.9585</td>
</tr>
<tr>
<td>Samsung SSD 970 EVO</td>
<td>NTFS (9p)</td>
<td>Ubuntu 20.04 (WSL)</td>
<td>Ninja</td>
<td>1000</td>
<td>2.1879</td>
</tr>
</tbody>
</table>
<p>From my testing, it seems ext4 is a remarkably resilient filesystem. I think
there is no performance argument to be made against globbing on ext4. It's also
pretty clear that you should not use ntfs-3g, or especially the WSL2 NTFS 9p
FUSE drivers. Build on ext4 and copy the outputs to an NTFS volume if need be.
VS 2019 is slower than Ninja, but even at 10000 files, it took under a second to
scan 10000 sources, so this is likely not a problem in absolute terms.</p>
<p>For some strange reason, NTFS was slower on my NVMe drive than on my SATA drive.
I tested both drives with <code>winsat disk -drive X</code>, and it showed my NVMe drive is
significantly faster. Maybe there's some driver weirdness here since the fastest
result for N=1000 was (virtualized!) ext4 on that drive.</p>
<p>I have published the Python script I used for testing
this <a href="https://github.com/alexreinking/cmake-glob-performance/">here</a>. There's a
GitHub Actions workflow that runs the script on Windows, macOS, and Linux for
N=1000. I expected the virtualized disks on GitHub Actions to be slow, but they
were actually plenty fast, with results very similar to what I reported above.</p>
<p>I am curious to hear reports from readers and from the <a href="https://mesonbuild.com/FAQ.html#why-cant-i-specify-target-files-with-a-wildcard">Meson</a> and <a href="http://neugierig.org/software/blog/2020/05/ninja.html">Ninja</a>
developers to see if they have more data on why globs are too slow for their
systems.</p>
<h2>Expect standard CMake variables to be honored</h2>
<p>A great number of variables in CMake are designed to be set externally. Perhaps
the most famous of these is <code>CMAKE_CXX_FLAGS</code> and its configuration-specific
variants <code>CMAKE_CXX_FLAGS_DEBUG</code>, <code>CMAKE_CXX_FLAGS_RELEASE</code>, etc. <em>Do not touch
these variables!</em></p>
<p>As a baseline, do not touch <em>any</em> standard variables if they are already defined
when your build runs. Move your preferred defaults to presets or use the
techniques above to update the cache safely. On older CMake versions, they may
be set in a toolchain file as an alternative to presets. A full list of
variables may be found in <a href="https://cmake.org/cmake/help/latest/manual/cmake-variables.7.html">the documentation</a>, but most start
with <code>CMAKE_</code>. Notable exceptions include <code>BUILD_SHARED_LIBS</code>
and <code><PackageName>_ROOT</code>.</p>
<p>In many cases, there are better ways to set a build requirement than through
clobbering a reserved variable. For instance, if you want to set the C++ version
then you should use <em><a href="https://cmake.org/cmake/help/latest/prop_gbl/CMAKE_CXX_KNOWN_FEATURES.html">target features</a></em>, rather than
setting <code>CMAKE_CXX_STANDARD</code> or <em>(gasp!)</em> editing <code>CMAKE_CXX_FLAGS</code>.</p>
<div class="highlight"><pre><span></span><code><span class="nf">target_compile_features(</span><span class="nb">my_exe</span><span class="w"> </span><span class="no">PRIVATE</span><span class="w"> </span><span class="nb">cxx_std_14</span><span class="nf">)</span>
<span class="nf">target_compile_features(</span><span class="nb">my_lib</span><span class="w"> </span><span class="no">PUBLIC</span><span class="w"> </span><span class="nb">cxx_std_17</span><span class="nf">)</span><span class="w"> </span><span class="c"># PUBLIC so that linkees use >= C++17</span>
</code></pre></div>
<p>Setting the standard requirement as a <code>PUBLIC</code> (really <code>INTERFACE</code>) property on
a library will propagate this to linkees even after exporting <code>my_lib</code> for use
in a <code>find_package</code> module. We'll talk more about packaging and being a good
dependency in a few weeks.</p>
<p>Some libraries (like <a href="https://abseil.io/docs/cpp/guides/options">abseil</a>) change their ABI depending on the
active standard version. If you have to do this, then you can encode the
requirement by checking <code>CMAKE_CXX_STANDARD</code> to pick the correct <code>cxx_std_N</code>
feature to act as a usage requirement:</p>
<div class="highlight"><pre><span></span><code><span class="c"># C++14 or greater is required for my_lib</span>
<span class="nf">if</span> <span class="nf">(</span><span class="no">CMAKE_CXX_STANDARD</span><span class="w"> </span><span class="no">GREATER</span><span class="w"> </span><span class="m">14</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">target_compile_features(</span><span class="nb">my_lib</span><span class="w"> </span><span class="no">PUBLIC</span><span class="w"> </span><span class="nb">cxx_std_</span><span class="o">${</span><span class="nt">CMAKE_CXX_STANDARD</span><span class="o">}</span><span class="nf">)</span>
<span class="nf">else</span> <span class="nf">()</span>
<span class="w"> </span><span class="nf">target_compile_features(</span><span class="nb">my_lib</span><span class="w"> </span><span class="no">PUBLIC</span><span class="w"> </span><span class="nb">cxx_std_14</span><span class="nf">)</span>
<span class="nf">endif</span> <span class="nf">()</span>
</code></pre></div>
<p>Either way, your users can set a higher <code>CMAKE_CXX_STANDARD</code> value at the
command line. This empowers your users to ensure ABI compatibility when using
experimental support for draft C++ standards when building from source. If you
set <code>CMAKE_CXX_STANDARD</code> unconditionally, you take this control away from your
users.</p>
<h2>Conclusion</h2>
<p>This is what you should take away from this post:</p>
<ol>
<li>Your <code>CMakeLists.txt</code> file should be <em>minimal</em> and include only firm build
requirements; everything else should be opt-in (preferably in a preset).
Warning flags are not firm requirements.</li>
<li>The configure step of your build should never need to run twice in a row with
the same settings, and incremental builds should not require the user to
manually re-run CMake. This means using <code>CONFIGURE_DEPENDS</code> on globs or,
better yet, avoiding them.</li>
<li>Be careful when setting a cache variable, even without <code>FORCE</code>, as it might
remove a normal variable unpredictably. Before CMake 3.21 (unreleased), don't
<code>set(CACHE)</code> without confirming the variable does not exist.</li>
<li>Avoid touching standard CMake variables; prefer <a href="https://cmake.org/cmake/help/latest/manual/cmake-properties.7.html#properties-on-targets">target properties</a> or move
such settings to the presets (at least make your edits opt-in somehow). Stop
thinking in terms of <em>flags</em> and start thinking in terms of <em>goals</em>. It's
very common for novice (or even adept) CMake programmers to work themselves
into an <a href="https://en.wikipedia.org/wiki/XY_problem">XY problem</a> and try to shoehorn in a compiler-specific setting that
has already been abstracted.</li>
</ol>
<p>Next time, we'll talk about the target model and how to manage dependencies in
modern CMake. Until then, join the conversation <a href="https://www.reddit.com/r/cpp/comments/noyazt/how_to_use_cmake_without_the_agonizing_pain_part_2">here on Reddit</a>!</p>How to Use CMake Without the Agonizing Pain - Part 12021-05-22T21:37:00-07:002021-05-22T21:37:00-07:00Alex Reinkingtag:alexreinking.com,2021-05-22:/blog/how-to-use-cmake-without-the-agonizing-pain-part-1.html<blockquote>
<p>When age fell upon the world, and wonder went out of the minds of men; when
grey cities reared to smoky skies tall towers grim and ugly, in whose shadow
none might dream of the sun or of spring's flowering meads; when learning
stripped earth of her mantle of beauty …</p></blockquote><blockquote>
<p>When age fell upon the world, and wonder went out of the minds of men; when
grey cities reared to smoky skies tall towers grim and ugly, in whose shadow
none might dream of the sun or of spring's flowering meads; when learning
stripped earth of her mantle of beauty, and poets sang no more save of twisted
phantoms seen with bleared and inward-looking eyes; when these things had come
to pass, and childish hopes had gone away forever, there was a man who
travelled out of life on a quest into the spaces whither the world's dreams
had fled. <em>— H.P. Lovecraft</em></p>
</blockquote>
<p>I spent the better part of my off-hours last year rewriting <a href="https://github.com/halide/Halide">Halide</a>'s CMake
build.</p>
<p>I knew CMake had a polarizing reputation, but I needed to make Halide work
easily on Windows. The existing build didn't work right in CLion, it couldn't
find its dependencies (except on CI, somehow), and it didn't produce usable
packages. I figured I'd roll up my sleeves and get to work, and so I started
where anyone else would: by Googling "CMake tutorial".</p>
<p>I was nearly stricken blind.</p>
<p>There is <em>so</em> much bad information about CMake out there. It's pervasive. It's
high in the search results. Just about every StackOverflow answer is out of
date, wrong, or both. Heeding any of this advice will send you and your project
careening down a road to madness, paved into the earth by the sweat and tears of
those who have tried to port a project that hard-codes a library path.</p>
<p>If you don't want your builds to break, and your crops to die, you should learn
to use CMake properly. This is the first in a series of blog posts that will
attempt to teach you to use CMake effectively. My <a href="/blog/cmake-is-a-build-system.html">earlier post</a> about whether
CMake is a build system could be considered part 0 of this series.</p>
<p>So without further ado, let's talk about the most basic decision to make: what
version of CMake to use in the first place.</p>
<h2>Picking a CMake Version</h2>
<p>If you're writing an open source project, you most likely want to make your code
available to as many users as possible. So you might assume that you want to use
a very <em>old</em> CMake version to build your project. <strong>This is nonsense.</strong> Recent
versions of CMake are available absolutely everywhere. Your build's users are
technical: C++ developers, not laypeople. They can upgrade CMake if for some
reason they haven't yet. For every major platform, there are easy ways to get a
recent CMake version installed and kept up to date. Don't believe me? See the
table below.</p>
<table>
<thead>
<tr>
<th>OS</th>
<th>Arch</th>
<th>Source</th>
<th>Version</th>
<th>Update Process</th>
</tr>
</thead>
<tbody>
<tr>
<td>Windows 10</td>
<td>x86, amd64</td>
<td>Visual Studio 2019</td>
<td>3.19</td>
<td>Updated occasionally through VS installer</td>
</tr>
<tr>
<td>Windows 10</td>
<td>x86, amd64</td>
<td><a href="https://community.chocolatey.org/packages/cmake">Chocolatey</a></td>
<td>newest</td>
<td><code>choco upgrade</code></td>
</tr>
<tr>
<td>Windows 10</td>
<td>x86, amd64</td>
<td><a href="https://cmake.org/download/">Kitware</a> MSI</td>
<td>newest</td>
<td>Manual</td>
</tr>
<tr>
<td>Windows 10</td>
<td>x86, amd64</td>
<td><a href="https://cmake.org/download/">Kitware</a> ZIP</td>
<td>newest</td>
<td>Manual (no installer)</td>
</tr>
<tr>
<td>macOS 10.14+</td>
<td>universal</td>
<td><a href="https://formulae.brew.sh/formula/cmake#default">Homebrew</a></td>
<td>newest</td>
<td><code>brew upgrade</code></td>
</tr>
<tr>
<td>macOS 10.10+</td>
<td>universal</td>
<td><a href="https://cmake.org/download/">Kitware</a> DMG</td>
<td>newest</td>
<td>Manual</td>
</tr>
<tr>
<td>macOS 10.10+</td>
<td>universal</td>
<td><a href="https://cmake.org/download/">Kitware</a> TGZ</td>
<td>newest</td>
<td>Manual (no installer)</td>
</tr>
<tr>
<td>Ubuntu 16.04+, many other distros</td>
<td>x86, amd64, aarch64, armhf, ppc64el, s390x</td>
<td><a href="https://snapcraft.io/cmake">snap</a></td>
<td>newest</td>
<td>Fully automatic</td>
</tr>
<tr>
<td>Ubuntu 16.04+</td>
<td>x86, amd64</td>
<td><a href="https://apt.kitware.com">Kitware APT</a></td>
<td>newest</td>
<td><code>sudo apt upgrade</code></td>
</tr>
<tr>
<td>Ubuntu 20.04+</td>
<td>x86, amd64, aarch64, armhf</td>
<td><a href="https://apt.kitware.com">Kitware APT</a></td>
<td>newest</td>
<td><code>sudo apt upgrade</code></td>
</tr>
<tr>
<td>Ubuntu 20.04 LTS</td>
<td>x86, amd64, aarch64, armhf, ppc64el, s390x</td>
<td><a href="https://packages.ubuntu.com/focal/cmake">Ubuntu APT</a></td>
<td>3.16.3</td>
<td><code>sudo apt upgrade</code> (security only)</td>
</tr>
<tr>
<td>Linux (Generic)</td>
<td>amd64, aarch64</td>
<td><a href="https://cmake.org/download/">Kitware</a> TGZ</td>
<td>newest</td>
<td>Manual (no installer), only depends on glibc6</td>
</tr>
<tr>
<td><strong>ALL</strong></td>
<td>x86, amd64, aarch64, armhf, ppc64el, s390x</td>
<td><a href="https://pypi.org/project/cmake/">pip</a></td>
<td>newest</td>
<td><code>pip install -U cmake</code></td>
</tr>
</tbody>
</table>
<p><strong>I can't stress this enough:</strong> Kitware's portable tarballs and shell script
installers <em>do not require administrator access</em>. CMake is perfectly happy to
run as the current user out of your downloads directory if that's where you want
to keep it. Even more impressive, the CMake binaries in the tarballs are
<em>statically linked and require only libc6 as a dependency</em>. Glibc has been
ABI-stable since 1997. <strong>It will work on your system.</strong></p>
<p>We on the Halide team use the CMake 3.20.2 tarballs from Kitware on a variety of
aging and new ARM hardware for our build infrastructure. We used to build CMake
from scratch, which was a little painful, but since upstream started providing
ARM binaries, it's been trivial.</p>
<p>There are good reasons for using modern CMake versions, too. Beyond broader
compiler and platform compatibility, newer CMake versions offer many more
features to help keep your builds simple and expressive. One of the best
examples is CMake's CUDA support. It has gone through several evolutions from a
find module to a full first-class language. Working with CUDA prior to CMake
3.17 is about as much fun as eating glass. The move away from package variables
to targets with transitive, propagating properties has turned ugly, error-prone
build scripts into simple, declarative build <em>descriptions</em>. We will touch on
many of these features in the next few parts.</p>
<p>So there is no problem with taking a minimum version of 3.20 (the latest at time
of writing). <em>Maybe</em> it's worth taking a minimum of 3.16 just because Ubuntu
20.04 LTS is such a hold-out, but anything earlier than that is plain masochism.</p>
<p>Another hard requirement is that you must <strong>never use a version of CMake older
than your compiler</strong>. Older versions of CMake won't somehow know how to work
with a compiler that was released later in time, and the command line defaults
for GCC, Clang, and other major compilers changes frequently. The most basic
example of this is the default language version and set of supported language
versions. Other changes include the wording of errors and warnings that CMake
matches to detect compiler capabilities.</p>
<p>Thus, if you intend to use C++17 on Linux, you will need to use at least Clang 5
(released Sep 7, 2017) or GCC 7 (released May 2, 2017), so you therefore cannot
use a minimum CMake version prior to 3.9.3 (released Sep 20, 2017), and versions
prior to 3.8 (released April 10, 2017) didn't even understand <code>17</code> as a possible
value of the <code>CXX_STANDARD</code> target property, so there was no correct way to
enable it. Rather than doing this tedious and ultimately pointless work of
determining the oldest potentially compatible versions, <em>just use the newest</em>.</p>
<h2>Validating Your Minimum Version</h2>
<p>No matter what minimum version you pick for whatever reason, it would be a major
mistake to simply set <code>cmake_minimum_required(VERSION 3.X)</code> and call it a day.
You <strong>must</strong> also test with the actual CMake 3.X release on your local
development machine and on CI.</p>
<p>Why? Simply because the policy mechanism ensures <em>backwards</em> compatibility,
not <em>forwards</em> compatibility. If you use a more recent CMake version, nothing
will stop you from using a feature that is too new for the declared minimum
version. This is very, very common, too. Here are three examples off the top of
my head that have bitten me:</p>
<ol>
<li>You might use a <a href="https://cmake.org/cmake/help/latest/manual/cmake-generator-expressions.7.html">generator expression</a> that was not in the old CMake version.
CMake will not even try to warn you about this, and many common and useful
generator expressions were introduced later than you think.</li>
<li>You might rely on newer features of commands unintentionally. In particular,
CMake 3.18+ searches <code>lib64</code> directories when using <code>HINTS</code> arguments
to <code>find_library</code>, but older ones don't. So code for old versions have to
check <code>CMAKE_SIZEOF_VOID_P</code> and add those paths to <code>HINTS</code> manually. I don't
think this is documented; I bisected to find that version number.</li>
<li>CMake's <a href="https://cmake.org/cmake/help/latest/manual/cmake-modules.7.html#find-modules">find modules</a> change behavior pretty frequently. Old versions might
not understand a newer library version's package layout.</li>
</ol>
<p>So another basic rule is to <strong>never declare a minimum version lower than the one
you actually test your build against.</strong> I have seen projects in the wild that
claim compatibility with ridiculously old versions of CMake, like 2.6. Not only
is it extremely unlikely that those builds actually work with 2.6, newer
versions of CMake are soon dropping compatibility with versions before 2.8.12.
So this "increased" compatibility will in fact cost you users who are doing the
right thing by keeping up to date.</p>
<p>If you're setting up a CI pipeline, you should test your build with both the
absolute newest version of CMake, and the minimum required version. This will
allow you to very quickly catch backwards compatibility bugs and make upgrading
the version a breeze. I do this on GitHub Actions with the
<a href="https://github.com/marketplace/actions/actions-setup-cmake"><code>jwlawson/actions-setup-cmake</code></a> action. You can see such a
workflow <a href="https://github.com/leethomason/tinyxml2/blob/master/.github/workflows/test.yml">here</a> on <code>tinyxml2</code>, whose CMake build I recently helped
modernize.</p>
<h2>Conclusion</h2>
<p>These are the most important lessons from this post:</p>
<ol>
<li>Use the most recent CMake version. It is trivial to install and keep up to
date. If you <em>must</em> pick an older version, do it for a logical reason, not
because you're copying some ancient StackOverflow answer that set 3.5 as a
minimum.</li>
<li>Use a version of CMake at least as recent as your compiler version.</li>
<li>Always test your build with the actual CMake version you're taking as a
minimum.</li>
</ol>
<p>In <a href="/blog/how-to-use-cmake-without-the-agonizing-pain-part-2.html">part 2</a>, we'll talk about the contract between a CMake build and its many
consumers.</p>
<p>Until then, join the conversation <a href="https://www.reddit.com/r/cpp/comments/nitvir/how_to_use_cmake_without_the_agonizing_pain_part_1/">here on Reddit</a>!</p>
<hr>
<h2><em>Addendum</em></h2>
<h3>Distribution minimum versions</h3>
<p>Since publishing this article, I have heard from several readers that they
cannot upgrade their minimum versions because some particular Linux distribution
(e.g. Ubuntu 18.04 LTS, RHEL 7, etc.) packages an older version of CMake, and so
they must accept that version.</p>
<p>I stand by what I wrote. On the one hand, if the maintainers independently want
to include your package, then it's up to them to figure out how to use a newer
CMake version in their build process. If that means bootstrapping a newer CMake
version from source, so be it.</p>
<p>On the other hand, if you want to ask the maintainers to include your package,
and they won't let you use a newer version, you should instead ask yourself why
your package needs to be included in the <em>base distribution</em>. There are many
viable distribution methods on Linux these days. You could host your own APT or
RPM repository; you could release your package on pip or snap. Agreeing to
downgrade for the sake of <em>one distro</em> harms <em>all</em> of your users.</p>
<p>Lowest common denominator thinking is toxic to the progress of the C and C++
ecosystems. Distribution maintainers should periodically update CMake, even on
LTS releases. CMake is incredibly backwards compatible, but when there are
issues, there are also many recourses for a distribution maintainer: they can
package multiple CMake versions, they can patch a problematic package (and maybe
upstream the patch, which is better for everyone), or they can patch their
distribution of CMake. The vcpkg team occasionally has to <em>rewrite entire build
systems</em> for projects.</p>
<h3>Resources</h3>
<p>At the beginning of the article, I complained that there are no good learning
resources. Fortunately, this isn't quite true. So far as I know, the best places
to get high quality advice for writing CMake code are these:</p>
<ol>
<li>The <a href="https://discourse.cmake.org/">CMake Discourse</a> forum. The actual developers hang out and answer
questions here.</li>
<li>The <code>#cmake</code> channel on the <a href="https://cppalliance.org/slack/">CppLang Slack</a>. This is a very friendly
community for CMake users to think through build issues, ask beginner to
intermediate level questions, and share wisdom.</li>
<li>The book "<a href="https://crascit.com/professional-cmake/">Professional CMake</a>" by Craig Scott. Craig is a volunteer
maintainer of CMake, and he sells his book for $30 through his consulting
business. This is the most comprehensive and clearly written reference guide
for CMake. Even better, your purchase also includes updates to new editions
as the book is updated (and it is updated frequently). This is a must-have if
CMake is part of your job; you should convince your employer to purchase
copies for your team.</li>
</ol>
<p>If you don't want to buy Professional CMake or can't afford it, here are some
good <em>free</em> resources on the web:</p>
<ol>
<li>Craig gave a talk at CppCon 2019, "<a href="https://www.youtube.com/watch?v=m0DwB4OvDXk">Deep CMake for Library Authors</a>" that
covers issues including symbol visibility, library versioning, writing
install rules, and RPATH pitfalls.</li>
<li>Deniz Bahadir gave a pair of talks called "<a href="https://www.youtube.com/watch?v=y7ndUhdQuU8">More Modern CMake</a>"
and "<a href="https://www.youtube.com/watch?v=y9kSr5enrSk">Oh No! More Modern CMake</a>" at Meeting C++ 2018 and 2019, respectively.
These talks use CMake 3.12-3.14, so there are some things that are out of
date, but his explanation of the modern CMake targets system is very good.
We'll talk about dependencies soon, but I disagree with the approach here,
which sets <code>IMPORTED_GLOBAL</code> on targets created by <code>find_package</code> calls.
These talks are particularly valuable for showing the old, painful way of
doing things next to the new(er), less-painful way.</li>
<li>Robert Schumacher is the lead developer of <a href="https://github.com/microsoft/vcpkg">vcpkg</a> and has a lot of
experience with dealing with every type of problematic build system. He's
also a great presenter and generally a smart guy, so I wholeheartedly
recommend his talks:<ol>
<li>"<a href="https://www.youtube.com/watch?v=sBP17HQAQjk">Don't Package Your Libraries Write Packagable Libraries! Part 1</a>".</li>
<li>"<a href="https://www.youtube.com/watch?v=_5weX5mx8hc">Don't Package Your Libraries Write Packagable Libraries! Part 2</a>".
<em>Note:</em> I disagree with his use of globbing in CMake, but his point about
projects being <em>globbable</em> is good.</li>
<li>"<a href="https://www.youtube.com/watch?v=Lb3hlLlHTrs">How to Herd 1,000 Libraries</a>"</li>
</ol>
</li>
</ol>Building a Faster Triangular Solver than MKL2021-03-20T21:37:00-07:002021-03-20T21:37:00-07:00Alex Reinkingtag:alexreinking.com,2021-03-20:/blog/building-a-faster-triangular-solver-than-mkl.html<style>
.highlight span.kt { color: #cc7832 !important; }
</style>
<p>A significant part of my research involves investigating algorithms with
interesting properties and then trying to optimize them to fully understand how
they work. One recent, and fairly successful, exploration was into triangular
substitution solvers. In this blog post, I'm going to explain the …</p><style>
.highlight span.kt { color: #cc7832 !important; }
</style>
<p>A significant part of my research involves investigating algorithms with
interesting properties and then trying to optimize them to fully understand how
they work. One recent, and fairly successful, exploration was into triangular
substitution solvers. In this blog post, I'm going to explain the algorithm and
an unconventional recursive approach that broadly abstracts the design space for
possible optimizations.</p>
<p>The end result is a lower-triangular (forward substitution) linear equation
solver that beats MKL, at least on a (not too) simplified version of the
problem. If you just want the sources and none of the story, they
are <a href="https://github.com/alexreinking/strsv">available on GitHub</a>.</p>
<p>For the rest of this article, I'm going to assume you know entry level matrix
computations, i.e. how to multiply matrices and vectors, and how to do Gaussian
Elimination.</p>
<h2>BLAS and triangular solvers</h2>
<p><em>(If you are already familiar with BLAS or with <code>strsv</code> you can skip this
section.)</em></p>
<p>BLAS (Basic Linear Algebra Subprograms) is a specification of an interface of
common linear algebra operations, such as (most famously) matrix multiplication,
vector fused multiply-adds, and triangular substitution solvers. Most commonly,
BLAS-es are implemented in Fortran because of its superior aliasing semantics
(the <code>restrict</code> keyword is not necessary), but they are also available in C/C++
through the standard "cblas" interface.</p>
<p>The idea is that by specifying the interface and providing a reference
implementation, hardware vendors can produce optimized libraries for each of
their architectures. And boy, do they ever! There's a wealth of academic
literature and millions of dollars of commercial effort put into optimizing
these routines. Among the more notable implementations are <a href="https://www.openblas.net/">OpenBLAS</a> (which is
based on <a href="https://web.archive.org/web/20210224165545/https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2">GotoBLAS</a>), <a href="http://math-atlas.sourceforge.net/">ATLAS</a> (which tries to automatically tune itself to your
hardware), the AMD Optimizing CPU Libraries (aka <a href="https://developer.amd.com/amd-aocl/">AOCL</a>), NVIDIA's
GPU-based <a href="https://developer.nvidia.com/cublas">cuBLAS</a> and <a href="https://docs.nvidia.com/cuda/nvblas/">NVBLAS</a> (which are ridiculously fast), and most
famously, there's <a href="https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html">MKL</a>, which is widely regarded as the gold standard of x86
CPU BLAS implementations (well, at least <a href="https://www.agner.org/forum/viewtopic.php?f=1&t=6"><em>Intel's</em> x86</a>, though
this <a href="https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html">might be changing</a>). This list isn't exhaustive; supercomputing
vendors like <a href="https://docs.nersc.gov/development/libraries/libsci/">Cray</a> supply BLAS-es that are tuned to their hardware.</p>
<p>Now, as I explain in my <a href="https://www.youtube.com/watch?v=1ir_nEfKQ7A">CppCon 2020 talk</a>, BLAS and high-performance libraries
like it are fundamentally limited because they have to synchronize with main
memory in between each library call. There's no effective way to <em>fuse</em>
computations across stages. There are some C++ libraries, like <a href="https://eigen.tuxfamily.org/index.php?title=Main_Page">Eigen</a>
and <a href="http://arma.sourceforge.net/faq.html">armadillo</a>, that use templates to get some amount of fusion. However, their
results are less consistent, and their optimizations are less dramatic (local
fusion is no match for global reorganization) than using a full DSL designed for
the task, like the <a href="https://halide-lang.org">Halide</a> language I work on. More on Halide in a future post!</p>
<p>Still, most BLAS-es do a very good job of optimizing their routines. Matrix
multiplication in particular is an excellent exercise for anyone interested in
understanding machine performance because there are <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><msup><mi>n</mi><mn>3</mn></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(n^3)</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1.064108em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141079999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">3</span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span> floating-point
operations (FLOPS) to schedule against only <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><msup><mi>n</mi><mn>2</mn></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(n^2)</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1.064108em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141079999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span> data. This endows the
problem with a very rich design space. In fact, here at UC Berkeley, it is the
first assignment in our graduate parallel computing course. If you're
interested, the homework materials
are <a href="https://sites.google.com/lbl.gov/cs267-spr2020/hw-1?authuser=0">here</a>. <em>(By
the way, I'm proud to say that while writing this article I learned that my work
as a teaching assistant on the Spring 2020 edition of the course earned me an
"Outstanding GSI Award" from the EECS department.)</em></p>
<p>The API that we're discussing today is <code>strsv</code>. The problem it solves is the
matrix equation <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>A</mi><mi>x</mi><mo>=</mo><mi>b</mi></mrow><annotation encoding="application/x-tex">Ax=b</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">A</span><span class="mord mathdefault">x</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault">b</span></span></span></span> where <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>A</mi></mrow><annotation encoding="application/x-tex">A</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">A</span></span></span></span> is a square <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mo>×</mo><mi>n</mi></mrow><annotation encoding="application/x-tex">n \times n</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.66666em;vertical-align:-0.08333em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">n</span></span></span></span> matrix, and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>b</mi></mrow><annotation encoding="application/x-tex">b</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault">b</span></span></span></span> is a
vector of length <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi></mrow><annotation encoding="application/x-tex">n</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">n</span></span></span></span>. <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>A</mi></mrow><annotation encoding="application/x-tex">A</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">A</span></span></span></span> is assumed to be <em>triangular</em>, which allows fast
solving because simple, direct <em>substitution</em> may be used. Here's a quick
example; suppose we have the following equation:
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>A</mi><mi>x</mi><mo>=</mo><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>3</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>4</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>2</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><msub><mi>x</mi><mn>1</mn></msub></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><msub><mi>x</mi><mn>2</mn></msub></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><msub><mi>x</mi><mn>3</mn></msub></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow><mo>=</mo><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow><mo>=</mo><mi>b</mi></mrow><annotation encoding="application/x-tex">
Ax = \begin{pmatrix}
1 & 0 & 0 \\
3 & 1 & 0 \\
4 & 2 & 1
\end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix}
= \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} = b
</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">A</span><span class="mord mathdefault">x</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:3.60004em;vertical-align:-1.55002em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">3</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">4</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">3</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:3.60004em;vertical-align:-1.55002em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault">b</span></span></span></span></span>
Because <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>A</mi></mrow><annotation encoding="application/x-tex">A</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">A</span></span></span></span> is lower-triangular, we can immediately tell that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>x</mi><mn>1</mn></msub><mo>=</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">x_1 = 1</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.58056em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">1</span></span></span></span>. We can
very quickly eliminate <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>x</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">x_1</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.58056em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span> in the other rows, by just multiplying <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>x</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">x_1</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.58056em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span> by the
coefficient in each row in the column and subtracting it from the latter values
of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>b</mi></mrow><annotation encoding="application/x-tex">b</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault">b</span></span></span></span>. So we'll subtract <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>3</mn></mrow><annotation encoding="application/x-tex">3</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">3</span></span></span></span> and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>4</mn></mrow><annotation encoding="application/x-tex">4</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">4</span></span></span></span> from the second and third entries to get:
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>2</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><msub><mi>x</mi><mn>1</mn></msub></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><msub><mi>x</mi><mn>2</mn></msub></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><msub><mi>x</mi><mn>3</mn></msub></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow><mo>=</mo><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mrow><mo>−</mo><mn>2</mn></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mrow><mo>−</mo><mn>3</mn></mrow></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow></mrow><annotation encoding="application/x-tex">
\begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 2 & 1
\end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix}
= \begin{pmatrix} 1 \\ -2 \\ -3 \end{pmatrix}
</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:3.60004em;vertical-align:-1.55002em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">3</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:3.60004em;vertical-align:-1.55002em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">−</span><span class="mord">2</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">−</span><span class="mord">3</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span></span></span></span></span></p>
<blockquote>
<p>For a quick sketch of a proof of why this works, notice that each row
operation is equivalent to a matrix multiplication. In this case, the
matrices <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>R</mi><mn>1</mn></msub><mo separator="true">,</mo><msub><mi>R</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">R_1, R_2</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.8777699999999999em;vertical-align:-0.19444em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span> (below) applied to both sides of the equation (on the
left, since matrix multiplication is not commutative), gives us the equation
we have above.</p>
<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>R</mi><mn>1</mn></msub><mo>=</mo><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mrow><mo>−</mo><mn>3</mn></mrow></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow><mtext> </mtext><msub><mi>R</mi><mn>2</mn></msub><mo>=</mo><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mrow><mo>−</mo><mn>4</mn></mrow></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow></mrow><annotation encoding="application/x-tex">R_1 = \begin{pmatrix}
1 & 0 & 0 \\
-3 & 1 & 0 \\
0 & 0 & 1
\end{pmatrix} \; R_2 = \begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
-4 & 0 & 1
\end{pmatrix}</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.83333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:3.60004em;vertical-align:-1.55002em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">−</span><span class="mord">3</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:3.60004em;vertical-align:-1.55002em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">−</span><span class="mord">4</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span></span></span></span></span></p>
<p>That is, the equation <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>A</mi><mi>x</mi><mo>=</mo><mi>b</mi></mrow><annotation encoding="application/x-tex">Ax = b</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">A</span><span class="mord mathdefault">x</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault">b</span></span></span></span> has the same solution as <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">(</mo><msub><mi>R</mi><mn>2</mn></msub><msub><mi>R</mi><mn>1</mn></msub><mo stretchy="false">)</mo><mi>A</mi><mi>x</mi><mo>=</mo><mo stretchy="false">(</mo><msub><mi>R</mi><mn>2</mn></msub><msub><mi>R</mi><mn>1</mn></msub><mo stretchy="false">)</mo><mi>b</mi></mrow><annotation encoding="application/x-tex">(R_2 R_1) A x =
(R_2 R_1) b</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose">)</span><span class="mord mathdefault">A</span><span class="mord mathdefault">x</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose">)</span><span class="mord mathdefault">b</span></span></span></span>.</p>
</blockquote>
<p>In the final step, we eliminate the second column:
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><msub><mi>x</mi><mn>1</mn></msub></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><msub><mi>x</mi><mn>2</mn></msub></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><msub><mi>x</mi><mn>3</mn></msub></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow><mo>=</mo><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mrow><mo>−</mo><mn>2</mn></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow></mrow><annotation encoding="application/x-tex">
\begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix}
= \begin{pmatrix} 1 \\ -2 \\ 1 \end{pmatrix}
</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:3.60004em;vertical-align:-1.55002em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">3</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:3.60004em;vertical-align:-1.55002em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">−</span><span class="mord">2</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span></span></span></span></span>
We can check this answer, too:
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>A</mi><mi>x</mi><mo>=</mo><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>3</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>0</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>4</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>2</mn></mstyle></mtd><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mrow><mo>−</mo><mn>2</mn></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow><mo>=</mo><mrow><mo fence="true">(</mo><mtable columnspacing="1em" rowspacing="0.15999999999999992em"><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr><mtr><mtd><mstyle displaystyle="false" scriptlevel="0"><mn>1</mn></mstyle></mtd></mtr></mtable><mo fence="true">)</mo></mrow><mo>=</mo><mi>b</mi></mrow><annotation encoding="application/x-tex">
Ax = \begin{pmatrix}
1 & 0 & 0 \\
3 & 1 & 0 \\
4 & 2 & 1
\end{pmatrix} \begin{pmatrix} 1 \\ -2 \\ 1 \end{pmatrix}
= \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} = b
</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">A</span><span class="mord mathdefault">x</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:3.60004em;vertical-align:-1.55002em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">3</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">4</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">0</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">−</span><span class="mord">2</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:3.60004em;vertical-align:-1.55002em;"></span><span class="minner"><span class="mopen"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎝</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎜</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎛</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span><span class="mord"><span class="mtable"><span class="col-align-c"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05em;"><span style="top:-4.21em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-3.0099999999999993em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-1.8099999999999994em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.5500000000000007em;"><span></span></span></span></span></span></span></span><span class="mclose"><span class="delimsizing mult"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:2.05002em;"><span style="top:-2.2500000000000004em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎠</span></span></span><span style="top:-2.8100000000000005em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎟</span></span></span><span style="top:-4.05002em;"><span class="pstrut" style="height:3.1550000000000002em;"></span><span class="delimsizinginner delim-size4"><span>⎞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.55002em;"><span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault">b</span></span></span></span></span></p>
<p>Hooray! In the next section we'll go over the algorithm in the abstract and
write a naive implementation.</p>
<h2>Solver algorithm and interface</h2>
<p>So what does this look like as a formal algorithm? Well, what did we do on
paper? We started by going across the columns, and then within each column,
using the newly solved value in <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">x</span></span></span></span> to update the unsolved part. As a "plain"
English algorithm, it looks like this:</p>
<blockquote>
<ol>
<li><strong>Solving:</strong> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>L</mi><mi>x</mi><mo>=</mo><mi>b</mi></mrow><annotation encoding="application/x-tex">Lx = b</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">L</span><span class="mord mathdefault">x</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault">b</span></span></span></span></li>
<li><strong>Set</strong> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi><mo>=</mo><mi>b</mi></mrow><annotation encoding="application/x-tex">x = b</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">x</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault">b</span></span></span></span></li>
<li><strong>For each column</strong> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>j</mi></mrow><annotation encoding="application/x-tex">j</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.85396em;vertical-align:-0.19444em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span></span></span></span> <strong>of</strong> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>L</mi></mrow><annotation encoding="application/x-tex">L</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">L</span></span></span></span>:<ol>
<li><strong>Set</strong> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>x</mi><mi>j</mi></msub><mo>←</mo><msub><mi>x</mi><mi>j</mi></msub><mi mathvariant="normal">/</mi><msub><mi>L</mi><mrow><mi>j</mi><mi>j</mi></mrow></msub></mrow><annotation encoding="application/x-tex">x_j \leftarrow x_j / L_{jj}</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.716668em;vertical-align:-0.286108em;"></span><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathdefault mtight" style="margin-right:0.05724em;">j</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1.036108em;vertical-align:-0.286108em;"></span><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathdefault mtight" style="margin-right:0.05724em;">j</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span><span class="mord">/</span><span class="mord"><span class="mord mathdefault">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathdefault mtight" style="margin-right:0.05724em;">j</span><span class="mord mathdefault mtight" style="margin-right:0.05724em;">j</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span></span></span></span>. </li>
<li><strong>For each row</strong> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.65952em;vertical-align:0em;"></span><span class="mord mathdefault">i</span></span></span></span> <strong>in the column</strong> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>j</mi></mrow><annotation encoding="application/x-tex">j</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.85396em;vertical-align:-0.19444em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span></span></span></span>, <strong>starting with</strong> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>j</mi><mo>+</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">j+1</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.85396em;vertical-align:-0.19444em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">1</span></span></span></span>:<ol>
<li><strong>Update</strong> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>x</mi><mi>i</mi></msub><mo>←</mo><msub><mi>x</mi><mi>i</mi></msub><mo>−</mo><msub><mi>x</mi><mi>j</mi></msub><mo>⋅</mo><msub><mi>L</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow><annotation encoding="application/x-tex">x_i \leftarrow x_i - x_j \cdot L_{ij}</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.58056em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathdefault mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.73333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathdefault mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.730558em;vertical-align:-0.286108em;"></span><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathdefault mtight" style="margin-right:0.05724em;">j</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.969438em;vertical-align:-0.286108em;"></span><span class="mord"><span class="mord mathdefault">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathdefault mtight">i</span><span class="mord mathdefault mtight" style="margin-right:0.05724em;">j</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span></span></span></span></li>
</ol>
</li>
</ol>
</li>
</ol>
</blockquote>
<p>Now how do we turn this into code? For the sake of space (and my sanity writing
and optimizing this stuff), we'll make the following simplifying assumptions:</p>
<ol>
<li>The matrix <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>L</mi></mrow><annotation encoding="application/x-tex">L</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">L</span></span></span></span> is <em>lower</em> triangular.</li>
<li>The matrix <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>L</mi></mrow><annotation encoding="application/x-tex">L</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">L</span></span></span></span> has all <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>1</mn></mrow><annotation encoding="application/x-tex">1</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">1</span></span></span></span>s on its diagonal. This lets us skip the division
on line (3.1) above.</li>
<li>The matrix <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>L</mi></mrow><annotation encoding="application/x-tex">L</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">L</span></span></span></span> is stored in <em>column-major</em> order.</li>
<li>The matrix <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>L</mi></mrow><annotation encoding="application/x-tex">L</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">L</span></span></span></span> is stored in a large, dense array in natural order; the upper
half might contain useful information (like an <em>upper</em> triangular matrix), so
we cannot overwrite it or assume it to be zero.</li>
<li>The vector <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>b</mi></mrow><annotation encoding="application/x-tex">b</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault">b</span></span></span></span> is stored in a normal array and may be overwritten with the
solution <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">x</span></span></span></span>.</li>
<li>We're running on a single CPU core.</li>
</ol>
<p>The naive translation under these assumptions into plain C is this:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kt">void</span><span class="w"> </span><span class="nf">naive_solver</span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="o">*</span><span class="w"> </span><span class="n">L</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">n</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">j</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">n</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">j</span><span class="p">];</span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span>
<span class="p">}</span>
</code></pre></div></td></tr></table></div>
<p>These assumptions are so common that the BLAS API for this takes extra arguments
to inform the implementation when these are the case. Here's the full signature
in C:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span>
<span class="normal">8</span>
<span class="normal">9</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="k">enum</span><span class="w"> </span><span class="n">CBLAS_ORDER</span><span class="w"> </span><span class="p">{</span><span class="n">CblasRowMajor</span><span class="o">=</span><span class="mi">101</span><span class="p">,</span><span class="w"> </span><span class="n">CblasColMajor</span><span class="o">=</span><span class="mi">102</span><span class="p">};</span>
<span class="k">enum</span><span class="w"> </span><span class="n">CBLAS_TRANSPOSE</span><span class="w"> </span><span class="p">{</span><span class="n">CblasNoTrans</span><span class="o">=</span><span class="mi">111</span><span class="p">,</span><span class="w"> </span><span class="n">CblasTrans</span><span class="o">=</span><span class="mi">112</span><span class="p">,</span><span class="w"> </span><span class="n">CblasConjTrans</span><span class="o">=</span><span class="mi">113</span><span class="p">};</span>
<span class="k">enum</span><span class="w"> </span><span class="n">CBLAS_UPLO</span><span class="w"> </span><span class="p">{</span><span class="n">CblasUpper</span><span class="o">=</span><span class="mi">121</span><span class="p">,</span><span class="w"> </span><span class="n">CblasLower</span><span class="o">=</span><span class="mi">122</span><span class="p">};</span>
<span class="k">enum</span><span class="w"> </span><span class="n">CBLAS_DIAG</span><span class="w"> </span><span class="p">{</span><span class="n">CblasNonUnit</span><span class="o">=</span><span class="mi">131</span><span class="p">,</span><span class="w"> </span><span class="n">CblasUnit</span><span class="o">=</span><span class="mi">132</span><span class="p">};</span>
<span class="kt">void</span><span class="w"> </span><span class="nf">cblas_strsv</span><span class="p">(</span><span class="k">const</span><span class="w"> </span><span class="k">enum</span><span class="w"> </span><span class="n">CBLAS_ORDER</span><span class="w"> </span><span class="n">order</span><span class="p">,</span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="k">enum</span><span class="w"> </span><span class="n">CBLAS_UPLO</span><span class="w"> </span><span class="n">Uplo</span><span class="p">,</span>
<span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="k">enum</span><span class="w"> </span><span class="n">CBLAS_TRANSPOSE</span><span class="w"> </span><span class="n">TransA</span><span class="p">,</span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="k">enum</span><span class="w"> </span><span class="n">CBLAS_DIAG</span><span class="w"> </span><span class="n">Diag</span><span class="p">,</span>
<span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="o">*</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">lda</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="o">*</span><span class="n">X</span><span class="p">,</span>
<span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">incX</span><span class="p">);</span>
</code></pre></div></td></tr></table></div>
<p>The name <code>strsv</code> encodes a few facts about the function. The leading <code>s</code> stands
for "single-precision" and the trailing <code>v</code> stands for "vector". The base name
of the function is therefore <code>trs</code>, which is short for "<strong>tr</strong>iangular
<strong>s</strong>olve". Thus, the function solves a triangular matrix-vector equation in
single precision (i.e. <code>float</code>).</p>
<p>The <code>order</code> argument determines whether the input matrix will be treated as row
major or column major. To be column-major simply means that adding 1 to the
pointer into the matrix will move <em>down</em> one row (ie. with the current column);
similarly, row-major means that adding 1 moves to the <em>right</em> one column. The
<code>Uplo</code> argument tells the implementation whether we're giving it a lower or
upper triangular matrix. The <code>TransA</code> algorithm allows the user to ask that BLAS
implicitly transpose (or conjugate transpose in the case of complex values)
while solving. Finally, the <code>Diag</code> argument tells <code>strsv</code> whether the main
diagonal is all 1s.</p>
<p>So we can implement a function with the same signature and contract as above
using the BLAS library like so:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kt">void</span><span class="w"> </span><span class="nf">blas_solver</span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="o">*</span><span class="w"> </span><span class="n">L</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">cblas_strsv</span><span class="p">(</span><span class="n">CblasColMajor</span><span class="p">,</span><span class="w"> </span><span class="n">CblasLower</span><span class="p">,</span><span class="w"> </span><span class="n">CblasNoTrans</span><span class="p">,</span><span class="w"> </span><span class="n">CblasUnit</span><span class="p">,</span>
<span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">L</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></td></tr></table></div>
<p>Now is a good time to benchmark these two implementations to get some idea of
how far off we are.</p>
<h2>Benchmarking setup</h2>
<p>First things first: we need to understand how much work we're doing. It's pretty
clear that we're doing <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><msup><mi>n</mi><mn>2</mn></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(n^2)</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1.064108em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141079999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span> operations, but it's easy enough to get an exact
count. If we look at the naive algorithm, we'll notice that the innermost update
consists of two floating-point operations: (1) the multiplication between <code>x[j]</code>
and <code>L[i + n * j]</code>, and (2) the subtraction of the resulting value from <code>x[i]</code>.
Then the inner loop runs between <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>j</mi><mo>+</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">j+1</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.85396em;vertical-align:-0.19444em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">1</span></span></span></span> (inclusive) and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi></mrow><annotation encoding="application/x-tex">n</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">n</span></span></span></span> (exclusive). That's
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mo>−</mo><mi>j</mi><mo>−</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">n-j-1</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.66666em;vertical-align:-0.08333em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.85396em;vertical-align:-0.19444em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">1</span></span></span></span> iterations in total. The outer loop runs between <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>j</mi><mo>=</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">j=0</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.85396em;vertical-align:-0.19444em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">0</span></span></span></span> to <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>j</mi><mo>=</mo><mi>n</mi><mo>−</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">j=n-1</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.85396em;vertical-align:-0.19444em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.66666em;vertical-align:-0.08333em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">1</span></span></span></span>. In
math terms, the total number of FLOPS is:</p>
<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>2</mn><mo>⋅</mo><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>n</mi><mo>−</mo><mn>1</mn></mrow></munderover><mi>n</mi><mo>−</mo><mi>j</mi><mo>−</mo><mn>1</mn><mo>=</mo><mn>2</mn><mo>⋅</mo><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>n</mi><mo>−</mo><mn>1</mn></mrow></munderover><mi>j</mi><mo>=</mo><mn>2</mn><mo>⋅</mo><mfrac><mrow><mi>n</mi><mo>⋅</mo><mo stretchy="false">(</mo><mi>n</mi><mo>−</mo><mn>1</mn><mo stretchy="false">)</mo></mrow><mn>2</mn></mfrac><mo>=</mo><mi>n</mi><mo>⋅</mo><mo stretchy="false">(</mo><mi>n</mi><mo>−</mo><mn>1</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">
2 \cdot \sum_{j=0}^{n-1} n-j-1 = 2 \cdot \sum_{j=0}^{n-1} j
= 2 \cdot \frac{n \cdot (n-1)}{2} = n \cdot (n-1)
</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">2</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:3.2148900000000005em;vertical-align:-1.4137769999999998em;"></span><span class="mop op-limits"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.8011130000000006em;"><span style="top:-1.872331em;margin-left:0em;"><span class="pstrut" style="height:3.05em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathdefault mtight" style="margin-right:0.05724em;">j</span><span class="mrel mtight">=</span><span class="mord mtight">0</span></span></span></span><span style="top:-3.050005em;"><span class="pstrut" style="height:3.05em;"></span><span><span class="mop op-symbol large-op">∑</span></span></span><span style="top:-4.3000050000000005em;margin-left:0em;"><span class="pstrut" style="height:3.05em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathdefault mtight">n</span><span class="mbin mtight">−</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.4137769999999998em;"><span></span></span></span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.85396em;vertical-align:-0.19444em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">1</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">2</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:3.2148900000000005em;vertical-align:-1.4137769999999998em;"></span><span class="mop op-limits"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.8011130000000006em;"><span style="top:-1.872331em;margin-left:0em;"><span class="pstrut" style="height:3.05em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathdefault mtight" style="margin-right:0.05724em;">j</span><span class="mrel mtight">=</span><span class="mord mtight">0</span></span></span></span><span style="top:-3.050005em;"><span class="pstrut" style="height:3.05em;"></span><span><span class="mop op-symbol large-op">∑</span></span></span><span style="top:-4.3000050000000005em;margin-left:0em;"><span class="pstrut" style="height:3.05em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathdefault mtight">n</span><span class="mbin mtight">−</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.4137769999999998em;"><span></span></span></span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">2</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:2.113em;vertical-align:-0.686em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.427em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mopen">(</span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mord">1</span><span class="mclose">)</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.686em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.44445em;vertical-align:0em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">1</span><span class="mclose">)</span></span></span></span></span></p>
<p>So to solve an instance with an <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mo>×</mo><mi>n</mi></mrow><annotation encoding="application/x-tex">n\times n</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.66666em;vertical-align:-0.08333em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">n</span></span></span></span> matrix, we must perform
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mo>⋅</mo><mo stretchy="false">(</mo><mi>n</mi><mo>−</mo><mn>1</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">n\cdot (n-1)</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.44445em;vertical-align:0em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">1</span><span class="mclose">)</span></span></span></span> floating-point operations.</p>
<p>We're going to use AVX2 to optimize this routine because it's still a bit more
widely available than AVX-512 (and because it doesn't have quite so extreme CPU
frequency offsets). I have benchmarking set up on GitHub Actions. At time of
writing, the cloud runners have <a href="https://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%208171M.html">Xeon 8171M</a> CPUS clocked down to 2.3GHz. I also
tested locally on my <a href="https://www.cpu-world.com/CPUs/Core_i9/Intel-Core%20i9%20i9-7900X.html">i9-7900X</a> workstation. Both CPUs are Skylake, so I compile
with <code>-march=skylake</code> on GCC.</p>
<p>We're going to test against both <a href="https://www.openblas.net/">OpenBLAS</a> and <a href="https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html">MKL</a>. By default, both BLAS-es
dispatch the APIs to hardware-specific implementations by sniffing CPU flags.
Since the GitHub Actions runners support AVX-512, this would pose a challenge.
Fortunately, both BLASes offer ways to override this. When compiling OpenBLAS,
we may set <code>-DTARGET=HASWELL</code> on the CMake command line. For MKL, we can run
<code>export MKL_ENABLE_INSTRUCTIONS=AVX2</code>. To keep things on one core, we can
<code>export OPENBLAS_NUM_THREADS=1</code> and link to the sequential MKL library.</p>
<p>To get a full picture of performance, we'll test on a variety of matrix sizes so
that we can see how we perform when the data fits inside L1, L2, or L3 cache,
plus when it spills out into RAM. The L3 cache of the GitHub Actions chips is
35.75MB in size. Without getting too much into the math, there's 4 bytes per
<code>float</code> and less than <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mi>n</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">n^2</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.8141079999999999em;vertical-align:0em;"></span><span class="mord"><span class="mord mathdefault">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141079999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span> data in our working set. So using matrices at least
as large as <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>3000</mn><mo>×</mo><mn>3000</mn></mrow><annotation encoding="application/x-tex">3000\times 3000</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.72777em;vertical-align:-0.08333em;"></span><span class="mord">3</span><span class="mord">0</span><span class="mord">0</span><span class="mord">0</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">3</span><span class="mord">0</span><span class="mord">0</span><span class="mord">0</span></span></span></span> will exceed L3. To be safe, we'll use <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mo>=</mo><mn>4096</mn></mrow><annotation encoding="application/x-tex">n=4096</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">4</span><span class="mord">0</span><span class="mord">9</span><span class="mord">6</span></span></span></span> as
the upper bound.</p>
<p>Finally, I'll use <a href="https://github.com/google/benchmark">Google Benchmark</a> to compute performance numbers and use the
formula we derived above to scale raw time into FLOPS.</p>
<p>So here's our baseline:</p>
<p><img src="/images/blog/strsv-gha-baseline.png"
alt="strsv performance baseline on GitHub Actions"
class="m-auto block"
width="640"
height="480"></p>
<p>Keeping in mind that GCC has already auto-vectorized the naive implementation,
there doesn't seem to be a lot of headroom here. Roughly speaking, it looks like
the naive solutions runs at about 8 GFLOPS, while MKL runs around 12 GFLOPS or
50% faster. OpenBLAS is generally slower, but seems to do slightly better than
MKL when the size of the matrix is just about to escape the computer's L3 cache.
Naturally, once we hit RAM, the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><msup><mi>n</mi><mn>2</mn></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(n^2)</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1.064108em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141079999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span> work just isn't enough to hide the
latency of the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><msup><mi>n</mi><mn>2</mn></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(n^2)</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1.064108em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141079999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span> memory. This is in stark contrast to matrix
multiplication, which has <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><msup><mi>n</mi><mn>3</mn></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(n^3)</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1.064108em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141079999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">3</span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span> work to do.</p>
<h2>A curious recursion</h2>
<p>While analyzing the algorithm, I made one key observation: <strong>at the start of the
inner loop on iteration <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>j</mi></mrow><annotation encoding="application/x-tex">j</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.85396em;vertical-align:-0.19444em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span></span></span></span> of the outer loop, all the values of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>x</mi><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">x_i</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.58056em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathdefault mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span> for <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi><mo>≤</mo><mi>j</mi></mrow><annotation encoding="application/x-tex">i
\leq j</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.79549em;vertical-align:-0.13597em;"></span><span class="mord mathdefault">i</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.85396em;vertical-align:-0.19444em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span></span></span></span> are finalized.</strong> Thus, we can reformulate the problem into a
<em>recursive algorithm</em> that solves the <strong>top <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>k</mi><mo>×</mo><mi>k</mi></mrow><annotation encoding="application/x-tex">k \times k</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.77777em;vertical-align:-0.08333em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span></span></span></span> triangle</strong> first, then
uses the first <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span></span></span></span> entries of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">x</span></span></span></span> along with the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mo>−</mo><mi>k</mi><mo>×</mo><mi>k</mi></mrow><annotation encoding="application/x-tex">n-k \times k</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.66666em;vertical-align:-0.08333em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.77777em;vertical-align:-0.08333em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span></span></span></span> <em>rectangle</em>
below that triangle to update the remaining <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mo>−</mo><mi>k</mi></mrow><annotation encoding="application/x-tex">n-k</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.66666em;vertical-align:-0.08333em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span></span></span></span> entries of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">x</span></span></span></span>. Finally, we
can solve the <strong>right <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mo>−</mo><mi>k</mi><mo>×</mo><mi>n</mi><mo>−</mo><mi>k</mi></mrow><annotation encoding="application/x-tex">n-k \times n-k</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.66666em;vertical-align:-0.08333em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.77777em;vertical-align:-0.08333em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.66666em;vertical-align:-0.08333em;"></span><span class="mord mathdefault">n</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span></span></span></span> triangle</strong> with the updated bottom part
of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">x</span></span></span></span>.</p>
<p>This is a sort of <em>divide and conquer</em> approach to this problem. When I came up
with it, I had never seen it before, but when I started poking around, I found
some recent work by Elmar Peise and Paolo Bientinesi: <a href="https://arxiv.org/pdf/1602.06763.pdf">"Recursive Algorithms for
Dense Linear Algebra: The ReLAPACK Collection"</a>. On the one hand, this
was disappointing because my idea wasn't actually novel (hence, a blog post
rather than a research paper); on the other hand, this was encouraging because
it meant I was on the right track. Such is life.</p>
<p>Anyway, the next insight is that the way you combine the lower rectangle with
the solved part of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">x</span></span></span></span> is to compute a matrix-vector product between them and
subtract the result from the unsolved part of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">x</span></span></span></span>. To see that, look at the
computation we're doing:</p>
<div class="highlight"><pre><span></span><code><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
</code></pre></div>
<p>Now <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>j</mi></mrow><annotation encoding="application/x-tex">j</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.85396em;vertical-align:-0.19444em;"></span><span class="mord mathdefault" style="margin-right:0.05724em;">j</span></span></span></span> ranges over <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">[</mo><mn>0</mn><mo separator="true">,</mo><mi>k</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">[0, k)</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">[</span><span class="mord">0</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span><span class="mclose">)</span></span></span></span>, because we already handled the top triangle. We
also know that <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.65952em;vertical-align:0em;"></span><span class="mord mathdefault">i</span></span></span></span> ranges from <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span></span></span></span> to <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi></mrow><annotation encoding="application/x-tex">n</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">n</span></span></span></span>. This code then becomes the following,
in numpy-esque vector notation:</p>
<div class="highlight"><pre><span></span><code><span class="n">x</span><span class="p">[</span><span class="n">k</span><span class="p">:</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="n">k</span><span class="p">:</span><span class="n">n</span><span class="p">]</span> <span class="o">-</span> <span class="n">L</span><span class="p">[</span><span class="n">k</span><span class="p">:</span><span class="n">n</span><span class="p">,</span> <span class="mi">0</span><span class="p">:</span><span class="n">k</span><span class="p">]</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="n">k</span><span class="p">]</span>
</code></pre></div>
<p>Very helpfully, the BLAS contains an operation, <code>sgemv</code>, that does exactly this.
So the lazy way to implement the recursive algorithm is to reduce it to <code>sgemv</code>
like this:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span>
<span class="normal">19</span>
<span class="normal">20</span>
<span class="normal">21</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kt">void</span><span class="w"> </span><span class="nf">solve_dnc</span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="o">*</span><span class="n">L</span><span class="p">,</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">lda</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">BASE_CASE_LIMIT</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1">// Naive algorithm.</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">n</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1">// GCC happens to generate better code</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">i</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">j</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1">// for this loop order. Don't know why.</span>
<span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">lda</span><span class="p">];</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="mi">2</span><span class="p">;</span><span class="w"> </span><span class="c1">// WOAH - this one line determines the algorithm</span>
<span class="w"> </span><span class="c1">// Upper triangle -- reads L(:k,:k), x(:k); writes x(:k)</span>
<span class="w"> </span><span class="n">solve_dnc</span><span class="p">(</span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">L</span><span class="p">,</span><span class="w"> </span><span class="n">lda</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">);</span>
<span class="w"> </span><span class="c1">// Rectangle -- reads L(k:,:k), x(:k); writes x(k:)</span>
<span class="w"> </span><span class="n">cblas_sgemv</span><span class="p">(</span><span class="n">CblasColMajor</span><span class="p">,</span><span class="w"> </span><span class="n">CblasNoTrans</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="mf">-1.f</span><span class="p">,</span><span class="w"> </span><span class="n">L</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">lda</span><span class="p">,</span>
<span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mf">1.f</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">);</span>
<span class="w"> </span><span class="c1">// Right triangle -- reads L(k:,k:), x(k:); writes x(k:)</span>
<span class="w"> </span><span class="n">solve_dnc</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">L</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">lda</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">lda</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">k</span><span class="p">);</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</code></pre></div></td></tr></table></div>
<p>This version takes an extra parameter, <code>lda</code>, to manage the distance between
columns independently of the logical dimension. The neat thing about this
characterization is that it lets us explore the space of optimizations entirely
by varying the function that calculates <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span></span></span></span>. In this case, we chose a recursive
approach, but we could also set it to, say, <code>BASE_CASE_LIMIT</code> to proceed in
blocks of columns (spoiler alert!), or to <code>n - BASE_CASE_LIMIT</code> to proceed in
blocks of rows. Various hybrid approaches could be designed off of this, too,
all by varying that one line of code.</p>
<p>There are some clear disadvantages here. This isn't tail recursive, so it will
take some extra stack space and cost some function call overhead. The compiler
also can't inline <code>strsv</code> since it's squirreled away in a shared library and is
very, very proprietary (so no-go on LTO). Still, this exercise has clearly
exposed our best vectorization opportunity. It would be very difficult to
vectorize a small triangle, but maybe we can get away with only doing <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><mi>n</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(n)</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord mathdefault">n</span><span class="mclose">)</span></span></span></span>
serial triangles, and <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><msup><mi>n</mi><mn>2</mn></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(n^2)</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1.064108em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141079999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span> easier-to-vectorize rectangles.</p>
<p>Since I know you're curious, this is how well the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi><mi mathvariant="normal">/</mi><mn>2</mn></mrow><annotation encoding="application/x-tex">n/2</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathdefault">n</span><span class="mord">/</span><span class="mord">2</span></span></span></span> divide and conquer
approach works.</p>
<p><img src="/images/blog/strsv-gha-dnc.png"
alt="strsv performance of all solvers on GitHub Actions"
class="m-auto block"
width="640"
height="480"></p>
<p>It's surprisingly in the ballpark when using OpenBLAS's <code>sgemv</code>. What's
interesting is that for at least one matrix size, it ever-so-slightly edges out
MKL, despite being built from OpenBLAS. This could be a fluke, but I bet there's
an even better optimization than any in this article that we just haven't found
yet.</p>
<h2>Lower-level optimization</h2>
<p>I played around with the divide-and-conquer approach for a bit and settled on a
split "function" of simply <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>k</mi><mo>=</mo><mn>8</mn></mrow><annotation encoding="application/x-tex">k=8</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">8</span></span></span></span>. That corresponds to looping over 8-wide block
columns of the matrix, solving the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>8</mn><mo>×</mo><mn>8</mn></mrow><annotation encoding="application/x-tex">8\times 8</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.72777em;vertical-align:-0.08333em;"></span><span class="mord">8</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">8</span></span></span></span> triangle at the top and then the
whole rectangle beneath it. It seemed to perform best on my workstation, and so
I set out to "inline" everything and get it cleaned up. Here it is, chunk by
chunk.</p>
<p><em>Note, for simplicity, I'm specializing this code to multiple-of-8 matrix sizes.
Extending it to other matrix sizes only requires dealing with a small leftover
rectangle at the bottom of each block column. It's just another code path, and
the same basic strategies apply. It's a good exercise, but too much for a blog
post. Also, as you'll see, the resulting code is so much faster that MKL could
get a boost just by testing the matrix size and then dispatching to this solver
if it fits. That one branch up front would cost next to nothing.</em></p>
<p>First, we'll declare the function and start looping:</p>
<div class="highlight"><pre><span></span><code><span class="kt">void</span><span class="w"> </span><span class="nf">update_blocked</span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="o">*</span><span class="n">L</span><span class="p">,</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">lda</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="nb">true</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
</code></pre></div>
<p>So why are we using an infinite loop here rather than a for loop over the block
columns? Well, remember that we're going to solve a triangle, then a rectangle,
then a triangle, and so on until we hit the rightmost triangle, which <em>has no
rectangle</em> underneath it. So we want to exit the loop right away without testing
the conditions for the would-be for loop or for the rectangle code again. Here's
the code for the triangle and early stopping:</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="c1">// Handle triangle at top of block column</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">j</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">lda</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">j</span><span class="p">];</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="c1">// Last iteration doesn't have a rectangle</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="mi">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">return</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
</code></pre></div>
<p>At this point, we have solved the first 8 values of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">x</span></span></span></span>. We subtract 8 from <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>n</mi></mrow><annotation encoding="application/x-tex">n</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">n</span></span></span></span>
right away since the following code operates on the shorter rectangle. Now we're
going to take those 8 values we just computed and <em>broadcast</em> them into 8 vector
registers. We first create a typedef to use GCC's vector types feature,</p>
<div class="highlight"><pre><span></span><code><span class="c1">// Vector of 8 single-precision floats</span>
<span class="k">typedef</span><span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="n">v8sf</span><span class="w"> </span><span class="nf">__attribute__</span><span class="p">((</span><span class="n">vector_size</span><span class="p">(</span><span class="mi">32</span><span class="p">),</span><span class="w"> </span><span class="n">aligned</span><span class="p">(</span><span class="mi">1</span><span class="p">)));</span>
</code></pre></div>
<p>and then create an array of these with the broadcast values:</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="n">v8sf</span><span class="w"> </span><span class="n">x_solved</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">x_solved</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">_mm256_broadcast_ss</span><span class="p">(</span><span class="o">&</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="w"> </span><span class="p">}</span>
</code></pre></div>
<p>Because we're using GCC's vector types and its own intrinsics, it is smart
enough to compile this into exactly 8 instructions that load the values into
registers. So there's no overhead from the loop or from the array. We load these
values into registers now because they're involved in every computation in the
rectangle, so we don't want to constantly reload them from memory. We broadcast
them so that we can load individual columns into vectors from inside the block
column. For example, we can take a vector from the first column in the block,
multiply it by <code>x_solved[0]</code> and then subtract it from the corresponding portion
of <code>x</code>.</p>
<p>To set this up, we'll advance <code>L</code> to point to the top of the rectangle and
advance <code>x</code> to point to the first unsolved portion and then enter the loop:</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="n">L</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span>
<span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">n</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">8</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
</code></pre></div>
<p>The first order of business is to load a vector's worth of the unsolved chunk of
<code>x</code>. We have to do an unaligned load (<code>loadu</code>) because alignment wasn't in our
assumptions and because aligning it would take too long (remember, <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><msup><mi>n</mi><mn>2</mn></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(n^2)</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1.064108em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141079999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span> on
both operations and memory).</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="n">v8sf</span><span class="w"> </span><span class="n">x_i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">_mm256_loadu_ps</span><span class="p">(</span><span class="o">&</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
</code></pre></div>
<p>Then we'll load an <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>8</mn><mo>×</mo><mn>8</mn></mrow><annotation encoding="application/x-tex">8 \times 8</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.72777em;vertical-align:-0.08333em;"></span><span class="mord">8</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:0.64444em;vertical-align:0em;"></span><span class="mord">8</span></span></span></span> patch of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>L</mi></mrow><annotation encoding="application/x-tex">L</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">L</span></span></span></span> into vectors using the same trick
as above.</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="n">v8sf</span><span class="w"> </span><span class="n">L_patch</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">L_patch</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">_mm256_loadu_ps</span><span class="p">(</span><span class="o">&</span><span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">lda</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">j</span><span class="p">]);</span>
<span class="w"> </span><span class="p">}</span>
</code></pre></div>
<p>Finally, we update the unsolved vector using that patch of values from the
matrix. We write the vector back to <code>x</code> and advance <code>L</code> to the tip of the next
triangle, ready to repeat the process.</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">x_i</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="n">x_solved</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">L_patch</span><span class="p">[</span><span class="n">j</span><span class="p">];</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="n">_mm256_storeu_ps</span><span class="p">(</span><span class="o">&</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">x_i</span><span class="p">);</span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1">// for i</span>
<span class="w"> </span><span class="n">L</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">lda</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1">// while true</span>
<span class="p">}</span>
</code></pre></div>
<p>The assembly generated for the rectangle loop is as short as can be. Just
thirteen instructions, almost all vectorized. You can see the full assembly on
Godbolt,
here: <a href="https://godbolt.org/z/YGWfoz9fs">https://godbolt.org/z/YGWfoz9fs</a>.</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="nl">.L5:</span><span class="w"> </span><span class="nf">vmovups</span><span class="w"> </span><span class="no">ymm0</span><span class="p">,</span><span class="w"> </span><span class="no">YMMWORD</span><span class="w"> </span><span class="no">PTR</span><span class="w"> </span><span class="p">[</span><span class="no">r8</span><span class="err">+</span><span class="mi">32</span><span class="err">+</span><span class="no">rax</span><span class="p">*</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="nf">vfnmadd213ps</span><span class="w"> </span><span class="no">ymm0</span><span class="p">,</span><span class="w"> </span><span class="no">ymm8</span><span class="p">,</span><span class="w"> </span><span class="no">YMMWORD</span><span class="w"> </span><span class="no">PTR</span><span class="w"> </span><span class="p">[</span><span class="no">rsi</span><span class="err">+</span><span class="mi">32</span><span class="err">+</span><span class="no">rax</span><span class="p">*</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="nf">vfnmadd231ps</span><span class="w"> </span><span class="no">ymm0</span><span class="p">,</span><span class="w"> </span><span class="no">ymm7</span><span class="p">,</span><span class="w"> </span><span class="no">YMMWORD</span><span class="w"> </span><span class="no">PTR</span><span class="w"> </span><span class="p">[</span><span class="no">r15</span><span class="err">+</span><span class="mi">32</span><span class="err">+</span><span class="no">rax</span><span class="p">*</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="nf">vfnmadd231ps</span><span class="w"> </span><span class="no">ymm0</span><span class="p">,</span><span class="w"> </span><span class="no">ymm6</span><span class="p">,</span><span class="w"> </span><span class="no">YMMWORD</span><span class="w"> </span><span class="no">PTR</span><span class="w"> </span><span class="p">[</span><span class="no">r14</span><span class="err">+</span><span class="mi">32</span><span class="err">+</span><span class="no">rax</span><span class="p">*</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="nf">vfnmadd231ps</span><span class="w"> </span><span class="no">ymm0</span><span class="p">,</span><span class="w"> </span><span class="no">ymm5</span><span class="p">,</span><span class="w"> </span><span class="no">YMMWORD</span><span class="w"> </span><span class="no">PTR</span><span class="w"> </span><span class="p">[</span><span class="no">r13</span><span class="err">+</span><span class="mi">32</span><span class="err">+</span><span class="no">rax</span><span class="p">*</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="nf">vfnmadd231ps</span><span class="w"> </span><span class="no">ymm0</span><span class="p">,</span><span class="w"> </span><span class="no">ymm4</span><span class="p">,</span><span class="w"> </span><span class="no">YMMWORD</span><span class="w"> </span><span class="no">PTR</span><span class="w"> </span><span class="p">[</span><span class="no">r12</span><span class="err">+</span><span class="mi">32</span><span class="err">+</span><span class="no">rax</span><span class="p">*</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="nf">vfnmadd231ps</span><span class="w"> </span><span class="no">ymm0</span><span class="p">,</span><span class="w"> </span><span class="no">ymm3</span><span class="p">,</span><span class="w"> </span><span class="no">YMMWORD</span><span class="w"> </span><span class="no">PTR</span><span class="w"> </span><span class="p">[</span><span class="no">rbx</span><span class="err">+</span><span class="mi">32</span><span class="err">+</span><span class="no">rax</span><span class="p">*</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="nf">vfnmadd231ps</span><span class="w"> </span><span class="no">ymm0</span><span class="p">,</span><span class="w"> </span><span class="no">ymm2</span><span class="p">,</span><span class="w"> </span><span class="no">YMMWORD</span><span class="w"> </span><span class="no">PTR</span><span class="w"> </span><span class="p">[</span><span class="no">rdi</span><span class="err">+</span><span class="mi">32</span><span class="err">+</span><span class="no">rax</span><span class="p">*</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="nf">vfnmadd231ps</span><span class="w"> </span><span class="no">ymm0</span><span class="p">,</span><span class="w"> </span><span class="no">ymm1</span><span class="p">,</span><span class="w"> </span><span class="no">YMMWORD</span><span class="w"> </span><span class="no">PTR</span><span class="w"> </span><span class="p">[</span><span class="no">rcx</span><span class="err">+</span><span class="mi">32</span><span class="err">+</span><span class="no">rax</span><span class="p">*</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="nf">vmovups</span><span class="w"> </span><span class="no">YMMWORD</span><span class="w"> </span><span class="no">PTR</span><span class="w"> </span><span class="p">[</span><span class="no">rsi</span><span class="err">+</span><span class="mi">32</span><span class="err">+</span><span class="no">rax</span><span class="p">*</span><span class="mi">4</span><span class="p">],</span><span class="w"> </span><span class="no">ymm0</span>
<span class="w"> </span><span class="nf">add</span><span class="w"> </span><span class="no">rax</span><span class="p">,</span><span class="w"> </span><span class="mi">8</span>
<span class="w"> </span><span class="nf">cmp</span><span class="w"> </span><span class="no">r9d</span><span class="p">,</span><span class="w"> </span><span class="no">eax</span>
<span class="w"> </span><span class="nf">jg</span><span class="w"> </span><span class="no">.L5</span>
</code></pre></div></td></tr></table></div>
<p>The beauty of this is how it minimizes memory traffic. We're streaming memory in
from <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>L</mi></mrow><annotation encoding="application/x-tex">L</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault">L</span></span></span></span> exactly the one time we need it, as part of the instruction that needs
it. In the assembly above, <code>ymm0</code> stores the unsolved vector from <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathdefault">x</span></span></span></span>, while
<code>ymm1-8</code> store the broadcast solved values.</p>
<p>The code for the triangle is messy and mostly scalar, but I stopped trying to
optimize once I saw this:</p>
<p><img src="/images/blog/strsv-gha-all.png"
alt="strsv performance of all solvers on GitHub Actions"
class="m-auto block"
width="640"
height="480"></p>
<p>At least on GitHub Actions, this blocked solver is <em>never</em> slower than MKL. At
peak, it's nearly twice the speed of the naive solver and 50% faster than MKL,
roughly. This is why I said I didn't want to bother with non-multiple-of-8 sizes
earlier. The dispatch would be totally lost in the gap.</p>
<h2>Conclusion</h2>
<p>The triangular solver routine must not get a lot of love in BLAS
implementations. Judging by the performance of my divide and conquer solver, I
wouldn't be surprised if MKL and OpenBLAS were just using (an inlined version
of) their own <code>sgemv</code> routines without giving this one any special attention.
Still, the results to effort ratio here is pretty striking.</p>
<p>It would be an interesting exercise to build a full-strength solver that handles
all matrix sizes, row-major layouts, double precision, etc. but that's too much
for one blog post (and too much for my purposes of understanding the design
space of this algorithm better).</p>CMake IS a Build System2021-03-13T21:37:00-08:002021-03-13T21:37:00-08:00Alex Reinkingtag:alexreinking.com,2021-03-13:/blog/cmake-is-a-build-system.html<p>One of the most common things you'll hear when learning CMake is that "CMake is
not a build system". This is <em>technically</em> correct, depending on one's
definition of a "build system". However, this statement alone is meaningless on
a practical level as it doesn't communicate anything actionable regarding how to …</p><p>One of the most common things you'll hear when learning CMake is that "CMake is
not a build system". This is <em>technically</em> correct, depending on one's
definition of a "build system". However, this statement alone is meaningless on
a practical level as it doesn't communicate anything actionable regarding how to
approach CMake. It just invites semantics games. The slightly clickbait headline
aside, my goal in this article is to unpack what CMake really <em>is</em> in a way you
can hopefully use to understand CMake better.</p>
<style>
.embed-container {
position: relative;
padding-bottom: 56.25%;
height: 0;
overflow: hidden;
max-width: 100%;
}
.embed-container iframe, .embed-container object, .embed-container embed {
position: absolute;
top: 0;
left: 0;
width: 100%;
height: 100%;
}
</style>
<div style='max-width: 560px; max-height: 315px; margin: auto;'>
<div class='embed-container'>
<iframe src="https://www.youtube-nocookie.com/embed/hou0lU8WMgo" frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen>
</iframe>
</div>
</div>
<p>Still, I do understand why people like saying this: technically correct <em>is</em> the
best kind of correct, after all.</p>
<h2>What is a build system?</h2>
<p>To even have this discussion, we'll have to pin down a definition of what a
build system is. Let's ask Jeff Atwood, co-creator of the venerable
StackExchange:</p>
<blockquote>
<p>The value of a build script is manifold. Once you have a build script
together, you've <strong>created a form of living documentation</strong>: here's how you
build this crazy thing. And naturally this artifact is checked into source
control, right alongside the files necessary to build it (and even the
database necessary to run it, too). From there, you can begin to think about
having that script run on a neutral build server to avoid the "Works On My
Machine" syndrome. [...]</p>
</blockquote>
<p>This is from his blog post "<a href="https://blog.codinghorror.com/the-f5-key-is-not-a-build-process/">The F5 Key Is Not a Build Process</a>". This was
written a while ago, in 2007, somewhat before CMake became wildly popular. It
was also written in the context of C#, which is more tolerant of "just click
'build' in Visual Studio" workflows than C++, which isn't managed.</p>
<p>Still, it touches on a very important point, which is that a build system serves
as a source of truth for how to build your software. If that's the essence of
what a build system is, then CMake fits the bill.</p>
<p>Maybe you don't believe Jeff. After all, he says "build process" rather than
"build system", so maybe he's talking about something else. Let's ask academia.
The 2018 paper, <a href="https://www.microsoft.com/en-us/research/uploads/prod/2018/03/build-systems.pdf">"Build Systems à la Carte"</a> by Andrey Mokhov, Neil
Mitchell, and Simon Peyton Jones (of Haskell fame), gives a rigorous definition:</p>
<blockquote>
<p><strong>Keys, values, and the store.</strong> The goal of any build system is to bring up
to date a store that implements a mapping from keys to values. In software
build systems the store is the file system, the keys are file names, and the
values are file contents. [...]
<br>
<strong>Task description.</strong> Any build system requires the user to specify how to
compute the new value for one key, using the (up to date) values of its
dependencies. We call this specification the task description. For example,
[...] in Make the rules in the makefile are the task description.
<br>
<strong>Build system.</strong> A build system takes a task description, a <em>target</em> key, and
a store, and returns a new store in which the target key and all its
dependencies have an up to date value.</p>
</blockquote>
<p>According to this definition, <em>technically</em>, CMake is not a build system because
it isn't responsible for running your tasks, so it can't bring the store "up to
date", but it does have a full task description language which assumes
dependencies on files and their time stamps.</p>
<p>On the other hand, this <em>is</em> a build system:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="ch">#!/bin/bash -e</span>
<span class="nb">cd</span><span class="w"> </span><span class="s2">"</span><span class="k">$(</span>dirname<span class="w"> </span><span class="s2">"</span><span class="si">${</span><span class="nv">BASH_SOURCE</span><span class="p">[0]</span><span class="si">}</span><span class="s2">"</span><span class="k">)</span><span class="s2">"</span>
cmake<span class="w"> </span>-G<span class="w"> </span>Ninja<span class="w"> </span>-S<span class="w"> </span>.<span class="w"> </span>-B<span class="w"> </span>_build<span class="w"> </span><span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
cmake<span class="w"> </span>--build<span class="w"> </span>_build
</code></pre></div></td></tr></table></div>
<p>The keys, value, and store are the same as they are for every conventional build
system: the file-system contents. The task description is now firmly the
CMakeLists.txt and the build system is this script. The fact that it calls Ninja
is an implementation detail. This is <em>also</em> technically correct.</p>
<p>Mokhov, et.al. is a fascinating paper, and you should absolutely read it (did
you know that Microsoft Excel is a build system?); but the purpose of their
research is to taxonomize the ways various build systems model tasks and
dependencies, and then carry out execution plans over those dependencies. It's
not about pragmatic questions concerning the software lifecycle, but about the
design space of certain tools that serve a particular purpose therein.</p>
<p>The descriptivist definition of "build system" would be much closer to what Jeff
has in mind. When most people think about build systems, they aren't narrowly
constraining themselves to the actual tool that invokes the compiler. For their
purposes, the meta/non-meta distinction doesn't affect how they interact with
CMake.</p>
<h2>A compilers look at CMake</h2>
<p>So why do people bother to draw this distinction? What do people think is
actually meaningful about CMake being a "meta build system" or a "build system
generator" rather than a plain "build system"? There isn't similar controversy
about the <a href="https://www.gnu.org/software/automake/manual/html_node/GNU-Build-System.html">GNU Build System</a> (ie. autotools), and it <em>also</em> has separate
configuration and build invocation steps. Heck, it <em>popularized</em> that process.
Ever see this?</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>./configure<span class="w"> </span><span class="o">&&</span><span class="w"> </span>make<span class="w"> </span><span class="o">&&</span><span class="w"> </span>make<span class="w"> </span>install
</code></pre></div>
<p>Yes, the <em>configure script</em> isn't a build system on its own, but you <em>always</em>
run <code>make</code> afterwards. Autotools and CMake both call themselves build systems.
Are they wrong? Sure, but only <em>technically</em>.</p>
<p>In the most common case, both CMake and Autotools are the <em>single source of
truth</em> for building their respective projects. In order to build such software,
you have to go through CMake (resp. Autotools) first. You get your pick of
execution engine, but it's semantically irrelevant (ideally). There is a bug
either in your CMake code or in CMake itself if you get different results from
one backend versus another.</p>
<p>In 2018, David Chisnall wrote an article in ACMQueue
titled <a href="https://queue.acm.org/detail.cfm?id=3212479">"C Is Not a Low-level Language"</a>. The tagline, "Your computer is not a
fast PDP-11", distills the central point of the article: that thinking about C
programming in tandem with your target architecture is <em>incorrect</em>, because C
targets an <em>abstract machine</em> which has its own semantics that the compiler is
responsible for mapping to the target assembly language. There are some
fascinating pitfalls detailed in the article, like how undefined behavior and
pointer provenance can disable "obvious" optimizations (like loop unswitching),
and delete null checks.</p>
<p>By analogy, <strong>CMake is not Make</strong>. Nor is it Visual Studio, or Ninja, or any of
its many target backends. If the CMake generator is the architecture, then CMake
code is C, the abstract build model it creates is the abstract machine, and
targets with generator expressions are its IR. It is accurate to say that CMake
is a <em>domain-specific language</em> for metaprogramming an <em>abstract build model</em>,
which is <em>assembled</em> into input files for a <em>build execution engine</em>
(<strong>*ahem*</strong> build system) of your choice.</p>
<p>When you search for "CMake is not a build system", this reminder appears in a
few different contexts. Sometimes it's cited as an advantage, for example, when
<a href="https://blog.jetbrains.com/clion/2014/09/cmake-vs-the-others-round-1/">JetBrains says</a>:</p>
<blockquote>
<p>Yet another benefit, is that CMake is not a build system in the general
meaning and doesn't lock its users on one particular build system: users are
free to use make/Ninja/etc to actually build the products; and that's a huge
advantage since neither build tool is suitable in all situations.</p>
</blockquote>
<p>Other times, it shows up to explain why something doesn't work how you'd expect
in CMake. <a href="https://gist.github.com/mbinna/c61dbb39bca0e4fb7d1f73b0d66a4fd1#dont-use-fileglob-in-projects">Several</a> <a href="https://www.cleanqt.io/blog/cmake-it-modern-using-c%2B%2B-and-qt,-part-1#warning-about-file-glob-">other</a> <a href="https://mjmorse.com/blog/cmake-template/#source-and-include-files-srccmakeliststxt-and-includecmakeliststxt">blog</a> <a href="https://github.com/cmu-db/noisepage/blob/master/CMakeLists.txt#L14-L32">posts</a> make this
claim to explain why globbing for sources is discouraged in CMake.</p>
<p>In these cases, I think it's much more <em>useful</em> to be precise and say "CMake is
not Make" as shorthand for the full truth: CMake's abstract build model must
trade-off between being a leaky abstraction and constraining itself to the least
common denominator among its targets. Just because you can glob for sources in
GNU Make doesn't mean that it's appropriate to do in CMake (I could write a
whole article on just this point; maybe I will). The reason for this isn't
because "CMake is not a build system", it's because globbing happens during
metaprogramming and doesn't make it into the final program (with 3.12+ there's
<code>CONFIGURE_DEPENDS</code>, but it's <a href="https://github.com/ninja-build/ninja/pull/1834">unreliable</a>, and the <a href="https://cmake.org/cmake/help/latest/command/file.html#glob">devs still
discourage it</a>).</p>
<p>There are certainly deficiencies in CMake's abstract model and (especially) its
metaprogramming / scripting language. I think it's more productive to talk about
those things in clear language than it is to wave your hands and say "CMake is
not a build system".</p>Building a Dual Shared and Static Library with CMake2021-03-06T21:37:00-08:002021-03-06T21:37:00-08:00Alex Reinkingtag:alexreinking.com,2021-03-06:/blog/building-a-dual-shared-and-static-library-with-cmake.html<style>
.highlight code {
display: inline-block;
min-width: 100%;
}
.highlight code .go {
display: inline-block;
width: 100%;
}
</style>
<p>When packaging software libraries, it is a common requirement to deploy both a
static and a shared version. However, CMake library targets are always either
one <em>or</em> the other. How do we make it easy for our …</p><style>
.highlight code {
display: inline-block;
min-width: 100%;
}
.highlight code .go {
display: inline-block;
width: 100%;
}
</style>
<p>When packaging software libraries, it is a common requirement to deploy both a
static and a shared version. However, CMake library targets are always either
one <em>or</em> the other. How do we make it easy for our users to choose which one
they want to link to, and <em>why</em> is this difficult to begin with?</p>
<p>In this article we're going to design a CMake build and <code>find_package</code> script
that enables library users to easily choose and switch between the two library
types. This also serves as a basic project template for a modern CMake library
build. The main thing it's missing is handling dependencies.</p>
<p><strong>TLDR:</strong> See this <a href="https://github.com/alexreinking/SharedStaticStarter">GitHub repo</a> with the full code, complete with GitHub
Actions testing.</p>
<h2>Design Philosophy</h2>
<p>So why is it tricky to provide both a static and shared version of a library in
CMake? The core issue is that a CMake library target models the build and usage
requirements for a <em>single</em> library configuration. When you
import <code>SomeLib::SomeLib</code> from a package, the library type is already determined
by the time you link another target to it. On the build side, this means that a
single library <em>target</em> corresponds to a single physical library on the system.</p>
<p>Static and shared libraries are typically produced from the same set of sources,
too, so new CMake users sometimes expect that a single call to <code>add_library</code>
will provide whatever mix of types they want. However, this is fundamentally
incompatible with CMake's model of linking, which admits no properties on the
<em>link itself</em>. It would also make it harder to make independent decisions about
position-independent code. Although most desktop systems (especially Linux)
favor PIC for its security benefits (see: ASLR), many embedded systems with slow
CPUs and strict power budgets either don't want or can't afford the overhead and
prefer to link statically. This often means that static and shared libraries
cannot share object files.</p>
<p>There's also no good guidance inside CMake for solving this problem from
the <code>find_package</code> side. Some modules, like FindCUDAToolkit, use separate
targets for each type. Others, like FindHDF5 and FindOpenSSL, use variables with
no common convention: HDF5 uses <code>HDF5_USE_STATIC_LIBRARIES</code> while OpenSSL
uses <code>OPENSSL_USE_STATIC_LIBS</code>.</p>
<p>So instead of copying a convention that doesn't exist, we will follow a few
guiding principles while trying to establish a new convention:</p>
<ol>
<li><strong>The build interface should match the install interface.</strong> It is
increasingly common to directly integrate third-party builds with the primary
build using <code>add_subdirectory</code> or <code>FetchContent</code>. The end-user experience
should not change when switching between these options and <code>find_package</code>.</li>
<li><strong>Only strict build requirements belong in CMakeLists.txt.</strong> Anything that
isn't absolutely necessary inevitably becomes an imposition on the end user.
For a common example, if the end user compiles with <code>-Werror</code> and you compile
with <code>-Wall</code>, then their compiler might throw a warning your compiler didn't.
Such settings belong in <a href="https://cmake.org/cmake/help/latest/manual/cmake-toolchains.7.html">toolchain files</a> or <a href="https://cmake.org/cmake/help/latest/manual/cmake-presets.7.html">presets files</a> (CMake 3.19+).</li>
<li><strong>A single project will not mix both shared and static</strong> versions of a
library. Certainly for a single target, it is totally illegal to link to both
at the same time. This means we don't need to support mixing both types in a
single directory.</li>
</ol>
<p>The bar for clean CMake code is <em>significantly</em> higher for a library than for an
application because the CMake code itself affects end users. For an application,
some ugliness is tolerable because it doesn't propagate through the dependency
tree (you don't typically link to executables). If an application does not
provide a CMake package or if the package it provides is broken, it is easy
enough to call <code>find_program</code> and have everything you need. On the other hand, a
bad CMake build might require <em>complete</em> replacement by a package maintainer.
This is a surprisingly common scenario in vcpkg and is the ultimate condemnation
of the upstream build. Don't write builds that have to be thrown out like this.</p>
<h2>A Common but Flawed Solution</h2>
<p>On the build side, a common solution is to create one target for each library
type and give them separate names, like so:</p>
<div class="highlight"><pre><span></span><code><span class="nf">set(</span><span class="nb">sources</span><span class="w"> </span><span class="p">...</span><span class="nf">)</span>
<span class="nf">add_library(</span><span class="nb">SomeLib_static</span><span class="w"> </span><span class="no">STATIC</span><span class="w"> </span><span class="o">${</span><span class="nt">sources</span><span class="o">}</span><span class="nf">)</span>
<span class="nf">add_library(</span><span class="nb">SomeLib_shared</span><span class="w"> </span><span class="no">SHARED</span><span class="w"> </span><span class="o">${</span><span class="nt">sources</span><span class="o">}</span><span class="nf">)</span>
</code></pre></div>
<p>Unfortunately, this fails to meet our design criteria.</p>
<p>Most users who invoke the build directly need only one of the two types, so this
approach doubles the compilation time for them. Using an object library doesn't
help since it would force position independent code on the static library.
Although users who directly include the build may use <code>EXCLUDE_FROM_ALL</code> to
build only what is needed, this is a relatively obscure feature and requires
extra code in the <code>FetchContent</code> case.</p>
<p>If your package exports just these targets, it forces the user to make an
up-front decision about whether to link statically or dynamically and then
propagates that decision transitively. Often, the decision whether to use static
or dynamic libraries belongs to the package maintainer. For instance, Linux
distributions generally require their packages to not have statically linked
dependencies and prefer libraries to dynamically link to system packages. It has
to be possible to create and install <em>only one</em> of these libraries, without
patching your build or your users' builds.</p>
<p>Robert Schumacher, a lead developer for vcpkg, cautions against this exact
practice in both his <a href="https://youtu.be/sBP17HQAQjk?t=488">CppCon 2018</a>
and <a href="https://youtu.be/_5weX5mx8hc?t=243">CppCon 2019</a> talks.
In <a href="https://youtu.be/Lb3hlLlHTrs?t=1310">another talk</a>, he explains that vcpkg
is sometimes forced to inject code that redirects the static target to the
shared one (or vice versa) when only one was built and installed.</p>
<h2>The Ideal User Experience</h2>
<p>So what should we do? Let's start by examining the <em>ideal</em> user experience for
using our library.</p>
<div class="highlight"><pre><span></span><code><span class="nf">cmake_minimum_required(</span><span class="no">VERSION</span><span class="w"> </span><span class="m">3.19</span><span class="nf">)</span>
<span class="nf">project(</span><span class="nb">example</span><span class="nf">)</span>
<span class="nf">find_package(</span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">REQUIRED</span><span class="nf">)</span>
<span class="nf">add_executable(</span><span class="nb">main</span><span class="w"> </span><span class="nb">main.cpp</span><span class="nf">)</span>
<span class="nf">target_link_libraries(</span><span class="nb">main</span><span class="w"> </span><span class="no">PRIVATE</span><span class="w"> </span><span class="nb">SomeLib</span><span class="o">::</span><span class="nb">SomeLib</span><span class="nf">)</span>
</code></pre></div>
<p>This looks great, but... there's nothing in there that says
whether <code>SomeLib::SomeLib</code> should be shared or static! How does this solve
anything?</p>
<p>Normally, the user sets <code>SomeLib_ROOT</code> or <code>CMAKE_PREFIX_PATH</code> to a path that
contains exactly the one version of <code>SomeLib</code> at configure time. We need to
keep supporting that pattern in our solution, but we <em>also</em> need to support a
distribution that contains both versions. </p>
<p>Our first major insight is this: because the build interface should match the
install interface, <code>SomeLib::SomeLib</code> should respect <code>BUILD_SHARED_LIBS</code>
the same way <code>FetchContent</code> or <code>add_subdirectory</code> would. However, overriding
this (or any) variable for one <code>find_package</code> call is a bit complicated. The
fully correct version—that preserves the existence and values
of <code>BUILD_SHARED_LIBS</code> no matter whether it is a cache or normal
variable—is this:</p>
<div class="highlight"><pre><span></span><code><span class="nf">function(</span><span class="nb">find_somelib</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="no">BUILD_SHARED_LIBS</span><span class="w"> </span><span class="no">YES</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">find_package(</span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">REQUIRED</span><span class="nf">)</span>
<span class="nf">endfunction()</span>
<span class="nf">find_somelib()</span>
</code></pre></div>
<p>When <code>find_somelib()</code> is called, it creates a new variable scope that is
destroyed when it returns. Thus, the variable environment after the <code>SomeLib</code>
package search succeeds is the same as it was before, so code that cares whether
<code>BUILD_SHARED_LIBS</code> is a normal or cache variable (or defined at all) continues
to work correctly. Saving and restoring the value of <code>BUILD_SHARED_LIBS</code> in the
obvious way requires, first, a temporary variable and, second, a check <em>before</em>
writing to <code>BUILD_SHARED_LIBS</code> whether it was defined to begin with.</p>
<p>On the other hand, the function also erases potentially useful variables set by
the package. The <em>targets</em> are tied to the directory scope, so linking
to <code>SomeLib::SomeLib</code> still works. If the package only provides targets, this
is not an issue. If some variables are needed, one could set variables in the
parent scope via <code>set(... PARENT_SCOPE)</code>, but this is awful. </p>
<p>Rather than forcing users to create bespoke functions to override a standard
variable, the package will respect a new variable, <code>SomeLib_SHARED_LIBS</code>, that
overrides <code>BUILD_SHARED_LIBS</code>. So now we can specify that we want shared libs
from <code>SomeLib</code> at the command line with <code>-DSomeLib_SHARED_LIBS=YES</code> or we
can enforce it in the <code>CMakeLists.txt</code> by simply setting it.</p>
<div class="highlight"><pre><span></span><code><span class="nf">set(</span><span class="nb">SomeLib_SHARED_LIBS</span><span class="w"> </span><span class="no">YES</span><span class="nf">)</span>
<span class="nf">find_package(</span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">REQUIRED</span><span class="nf">)</span>
</code></pre></div>
<p>However, <code>BUILD_SHARED_LIBS</code> is supposed to be reserved for the user and not set
by the build. It's no different for <code>SomeLib_SHARED_LIBS</code>; users should expect
this variable to be respected as a configuration point. To enable a user to
truly <em>force</em> <code>SomeLib</code> to be static or shared, we can use <code>find_package</code>'s
components mechanism:</p>
<div class="highlight"><pre><span></span><code><span class="nf">find_package(</span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">REQUIRED</span><span class="w"> </span><span class="nb">shared</span><span class="nf">)</span><span class="w"> </span><span class="c"># or `static`</span>
</code></pre></div>
<p>It is an error to request <em>both</em> <code>static</code> and <code>shared</code> components. If a single
build needs both, it may separate its targets into two directories and call
<code>find_package</code> with different components in each one. Since imported targets are
not global by default, this works without any intervention on our part.</p>
<h2>The Implementation</h2>
<p>So now let's make this work! We're going to implement this around a very simple
library that returns a <a href="https://xkcd.com/221/">random number</a>. Here's the source file:</p>
<div class="highlight"><pre><span></span><code><span class="c1">// src/random.cpp</span>
<span class="cp">#include</span><span class="w"> </span><span class="cpf">"somelib/random.h"</span>
<span class="k">namespace</span><span class="w"> </span><span class="nn">SomeLib</span><span class="w"> </span><span class="p">{</span>
<span class="c1">// Thanks to XKCD 221 for this useful function!</span>
<span class="kt">int</span><span class="w"> </span><span class="nf">getRandomNumber</span><span class="p">()</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="mi">42</span><span class="p">;</span><span class="w"> </span><span class="c1">// chosen by fair dice roll.</span>
<span class="w"> </span><span class="c1">// guaranteed to be random.</span>
<span class="p">}</span>
<span class="p">}</span><span class="w"> </span><span class="c1">// namespace SomeLib</span>
</code></pre></div>
<p>Here's the corresponding header:</p>
<div class="highlight"><pre><span></span><code><span class="c1">// include/somelib/random.h</span>
<span class="cp">#ifndef SOMELIB_RANDOM_H</span>
<span class="cp">#define SOMELIB_RANDOM_H</span>
<span class="cp">#include</span><span class="w"> </span><span class="cpf">"somelib/export.h"</span>
<span class="k">namespace</span><span class="w"> </span><span class="nn">SomeLib</span><span class="w"> </span><span class="p">{</span>
<span class="n">SOMELIB_EXPORT</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">getRandomNumber</span><span class="p">();</span>
<span class="p">}</span>
<span class="cp">#endif </span><span class="c1">//SOMELIB_RANDOM_H</span>
</code></pre></div>
<p><code>export.h</code> is a generated export header that CMake will create for us. It
provides the <code>SOMELIB_EXPORT</code> macro which tells the compiler which symbols to
expose from the shared version of our library.</p>
<h3>Build rules</h3>
<p>Now the start of the build is mostly boilerplate.</p>
<div class="highlight"><pre><span></span><code><span class="nf">cmake_minimum_required(</span><span class="no">VERSION</span><span class="w"> </span><span class="m">3.19</span><span class="nf">)</span>
<span class="nf">project(</span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">VERSION</span><span class="w"> </span><span class="m">1.0.0</span><span class="nf">)</span>
<span class="nf">if</span> <span class="nf">(</span><span class="no">NOT</span><span class="w"> </span><span class="no">DEFINED</span><span class="w"> </span><span class="no">CMAKE_CXX_VISIBILITY_PRESET</span><span class="w"> </span><span class="no">AND</span>
<span class="w"> </span><span class="no">NOT</span><span class="w"> </span><span class="no">DEFINED</span><span class="w"> </span><span class="no">CMAKE_VISIBILITY_INLINES_HIDDEN</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="no">CMAKE_CXX_VISIBILITY_PRESET</span><span class="w"> </span><span class="nb">hidden</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="no">CMAKE_VISIBILITY_INLINES_HIDDEN</span><span class="w"> </span><span class="no">YES</span><span class="nf">)</span>
<span class="nf">endif</span> <span class="nf">()</span>
</code></pre></div>
<p>Since CMake doesn't warn you if you use a feature that is too new for the
minimum version you should always specify the minimum version that you actually
test with.</p>
<p>The next two lines ensure that the shared library version doesn't export
anything unintentionally. MSVC hides symbols by default, whereas GCC and Clang
export everything. Exporting unintended symbols can cause conflicts and ODR
violations as dependencies are added down the line, so libraries should always
make their exports explicit (or at <em>least</em> use a linker script if retrofitting
the code is too much). Still, if the user manually specifies a different
setting, then we respect it.</p>
<p>Next, we'll implement the <code>SomeLib_SHARED_LIBS</code> override for the build interface
that was discussed earlier.</p>
<div class="highlight"><pre><span></span><code><span class="nf">if</span> <span class="nf">(</span><span class="no">DEFINED</span><span class="w"> </span><span class="nb">SomeLib_SHARED_LIBS</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="no">BUILD_SHARED_LIBS</span><span class="w"> </span><span class="s">"${SomeLib_SHARED_LIBS}"</span><span class="nf">)</span>
<span class="nf">endif</span> <span class="nf">()</span>
</code></pre></div>
<p>Now we can create the library. To keep the build and install interfaces
consistent, we also create an alias <code>SomeLib::SomeLib</code>. The version properties
make sure that namelinks and solinks are created for the shared library. </p>
<div class="highlight"><pre><span></span><code><span class="nf">add_library(</span><span class="nb">SomeLib</span><span class="w"> </span><span class="na">src/random.cpp</span><span class="nf">)</span>
<span class="nf">add_library(</span><span class="nb">SomeLib</span><span class="o">::</span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">ALIAS</span><span class="w"> </span><span class="nb">SomeLib</span><span class="nf">)</span>
<span class="nf">set_target_properties(</span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">PROPERTIES</span>
<span class="w"> </span><span class="no">VERSION</span><span class="w"> </span><span class="o">${</span><span class="nt">SomeLib_VERSION</span><span class="o">}</span>
<span class="w"> </span><span class="no">SOVERSION</span><span class="w"> </span><span class="o">${</span><span class="nt">SomeLib_VERSION_MAJOR</span><span class="o">}</span><span class="nf">)</span>
<span class="nf">target_include_directories(</span>
<span class="w"> </span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">PUBLIC</span><span class="w"> </span><span class="s">"$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>"</span><span class="nf">)</span>
<span class="nf">target_compile_features(</span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">PUBLIC</span><span class="w"> </span><span class="nb">cxx_std_17</span><span class="nf">)</span>
</code></pre></div>
<p>This assumes that we are using <a href="https://semver.org">semantic versioning</a> for the joint package and
library version. Next we'll create the export header we saw earlier and attach
it to the target. The <code>GenerateExportHeader</code> module assumes it's acting on a
shared library, so we have to manually add <code>SOMELIB_STATIC_DEFINE</code> to the static
build to avoid linker errors arising from DLL-import directives on Windows.</p>
<div class="highlight"><pre><span></span><code><span class="nf">include(</span><span class="nb">GenerateExportHeader</span><span class="nf">)</span>
<span class="nf">generate_export_header(</span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">EXPORT_FILE_NAME</span><span class="w"> </span><span class="na">include/somelib/export.h</span><span class="nf">)</span>
<span class="nf">target_compile_definitions(</span>
<span class="w"> </span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">PUBLIC</span><span class="w"> </span><span class="s">"$<$<NOT:$<BOOL:${BUILD_SHARED_LIBS}>>:SOMELIB_STATIC_DEFINE>"</span><span class="nf">)</span>
<span class="nf">target_include_directories(</span>
<span class="w"> </span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">PUBLIC</span><span class="w"> </span><span class="s">"$<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}/include>"</span><span class="nf">)</span>
</code></pre></div>
<p>It would be very nice if <code>generate_export_header</code> set up the definitions and
include paths automatically. This is the kind of busy-work that gives CMake a
bad rap.</p>
<p>Finally, we'll add some packaging logic, but include it by default <em>only if</em>
we're the top-level project. That insulates <code>FetchContent</code> users from our
install rules if they don't want them, but keeps them available in case they do:</p>
<div class="highlight"><pre><span></span><code><span class="nf">string(</span><span class="no">COMPARE</span><span class="w"> </span><span class="no">EQUAL</span><span class="w"> </span><span class="s">"${CMAKE_SOURCE_DIR}"</span><span class="w"> </span><span class="s">"${CMAKE_CURRENT_SOURCE_DIR}"</span><span class="w"> </span><span class="nb">is_top_level</span><span class="nf">)</span>
<span class="nf">option(</span><span class="nb">SomeLib_INCLUDE_PACKAGING</span><span class="w"> </span><span class="s">"Include packaging rules for SomeLib"</span><span class="w"> </span><span class="s">"${is_top_level}"</span><span class="nf">)</span>
<span class="nf">if</span> <span class="nf">(</span><span class="nb">SomeLib_INCLUDE_PACKAGING</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">add_subdirectory(</span><span class="nb">packaging</span><span class="nf">)</span>
<span class="nf">endif</span> <span class="nf">()</span>
</code></pre></div>
<h3>Packaging</h3>
<p>Now we'll take a look at what goes into the <code>packaging/CMakeLists.txt</code> file.</p>
<div class="highlight"><pre><span></span><code><span class="nf">include(</span><span class="nb">GNUInstallDirs</span><span class="nf">)</span>
<span class="nf">include(</span><span class="nb">CMakePackageConfigHelpers</span><span class="nf">)</span>
<span class="nf">if</span> <span class="nf">(</span><span class="no">NOT</span><span class="w"> </span><span class="no">DEFINED</span><span class="w"> </span><span class="nb">SomeLib_INSTALL_CMAKEDIR</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="nb">SomeLib_INSTALL_CMAKEDIR</span><span class="w"> </span><span class="s">"${CMAKE_INSTALL_LIBDIR}/cmake/SomeLib"</span>
<span class="w"> </span><span class="no">CACHE</span><span class="w"> </span><span class="no">STRING</span><span class="w"> </span><span class="s">"Path to SomeLib CMake files"</span><span class="nf">)</span>
<span class="nf">endif</span> <span class="nf">()</span>
</code></pre></div>
<p>The <code>GNUInstallDirs</code> module defines a bunch of variables that control the
default behavior of the <code>install()</code> commands and picks sane defaults for every
supported platform, including Windows. The name is mostly historical and should
probably be changed. We'll use <code>CMakePackageConfigHelpers</code> later to create a
required version compatibility script.</p>
<p>Since various package management systems (like vcpkg, Nuget, APT, etc.) have
different standards for where to place CMake package config scripts, we create a
cache variable, <code>SomeLib_INSTALL_CMAKEDIR</code>, to allow our users to control where
those scripts go. We pick a common, safe default.</p>
<p>Now we'll add the logic to install our libraries and headers:</p>
<div class="highlight"><pre><span></span><code><span class="nf">install(</span><span class="no">TARGETS</span><span class="w"> </span><span class="nb">SomeLib</span><span class="w"> </span><span class="no">EXPORT</span><span class="w"> </span><span class="nb">SomeLib_Targets</span>
<span class="w"> </span><span class="no">RUNTIME</span><span class="w"> </span><span class="no">COMPONENT</span><span class="w"> </span><span class="nb">SomeLib_Runtime</span>
<span class="w"> </span><span class="no">LIBRARY</span><span class="w"> </span><span class="no">COMPONENT</span><span class="w"> </span><span class="nb">SomeLib_Runtime</span>
<span class="w"> </span><span class="no">NAMELINK_COMPONENT</span><span class="w"> </span><span class="nb">SomeLib_Development</span>
<span class="w"> </span><span class="no">ARCHIVE</span><span class="w"> </span><span class="no">COMPONENT</span><span class="w"> </span><span class="nb">SomeLib_Development</span>
<span class="w"> </span><span class="no">INCLUDES</span><span class="w"> </span><span class="no">DESTINATION</span><span class="w"> </span><span class="s">"${CMAKE_INSTALL_INCLUDEDIR}"</span><span class="nf">)</span>
<span class="nf">install(</span><span class="no">DIRECTORY</span><span class="w"> </span><span class="s">"${SomeLib_SOURCE_DIR}/include/"</span><span class="w"> </span><span class="s">"${SomeLib_BINARY_DIR}/include/"</span>
<span class="w"> </span><span class="no">TYPE</span><span class="w"> </span><span class="no">INCLUDE</span>
<span class="w"> </span><span class="no">COMPONENT</span><span class="w"> </span><span class="nb">SomeLib_Development</span><span class="nf">)</span>
</code></pre></div>
<p>When we install <code>SomeLib</code>, we add it to an export set called <code>SomeLib_Targets</code>.
To support users to wish to package our library in separate runtime and
development components, we create prefixed component names (to avoid clashes
with other projects). We won't dwell on componentized packages here, but if
you've ever noticed that Ubuntu provides separate <code>libfoo</code> and <code>libfoo-dev</code>
packages, that's what this is for. To learn more, watch Craig
Scott's <a href="https://youtu.be/m0DwB4OvDXk">CppCon 2019 talk</a>, "Deep CMake for
Library Authors".</p>
<p>Now we'll export our targets to a file specific to the library type:</p>
<div class="highlight"><pre><span></span><code><span class="nf">if</span> <span class="nf">(</span><span class="no">BUILD_SHARED_LIBS</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="nb">type</span><span class="w"> </span><span class="nb">shared</span><span class="nf">)</span>
<span class="nf">else</span> <span class="nf">()</span>
<span class="w"> </span><span class="nf">set(</span><span class="nb">type</span><span class="w"> </span><span class="nb">static</span><span class="nf">)</span>
<span class="nf">endif</span> <span class="nf">()</span>
<span class="nf">install(</span><span class="no">EXPORT</span><span class="w"> </span><span class="nb">SomeLib_Targets</span>
<span class="w"> </span><span class="no">DESTINATION</span><span class="w"> </span><span class="s">"${SomeLib_INSTALL_CMAKEDIR}"</span>
<span class="w"> </span><span class="no">NAMESPACE</span><span class="w"> </span><span class="nb">SomeLib</span><span class="o">::</span>
<span class="w"> </span><span class="no">FILE</span><span class="w"> </span><span class="nb">SomeLib-</span><span class="o">${</span><span class="nt">type</span><span class="o">}</span><span class="p">-</span><span class="nb">targets.cmake</span>
<span class="w"> </span><span class="no">COMPONENT</span><span class="w"> </span><span class="nb">SomeLib_Development</span><span class="nf">)</span>
</code></pre></div>
<p>When the library is built as a shared library, we get
<code>SomeLib-shared-targets.cmake</code> and when it's built as a static library, we get
<code>SomeLib-static-targets.cmake</code>. To turn this into a bona-fide CMake package, we
need two files: <code>SomeLibConfig.cmake</code> and <code>SomeLibConfigVersion.cmake</code>. The
latter is easy to auto-generate since we're using semantic versioning:</p>
<div class="highlight"><pre><span></span><code><span class="nf">write_basic_package_version_file(</span>
<span class="w"> </span><span class="nb">SomeLibConfigVersion.cmake</span>
<span class="w"> </span><span class="no">COMPATIBILITY</span><span class="w"> </span><span class="nb">SameMajorVersion</span><span class="nf">)</span>
</code></pre></div>
<p>The purpose of this file is to support the version number argument
to <code>find_package</code>. It prevents an incompatible package from being loaded when a
version number is specified. The meat of the CMake package is defined in
<code>SomeLibConfig.cmake</code>, but we'll discuss that in just a moment. The last rule
places these two files in the CMake installation directory.</p>
<div class="highlight"><pre><span></span><code><span class="nf">install(</span><span class="no">FILES</span>
<span class="w"> </span><span class="s">"${CMAKE_CURRENT_SOURCE_DIR}/SomeLibConfig.cmake"</span>
<span class="w"> </span><span class="s">"${CMAKE_CURRENT_BINARY_DIR}/SomeLibConfigVersion.cmake"</span>
<span class="w"> </span><span class="no">DESTINATION</span><span class="w"> </span><span class="s">"${SomeLib_INSTALL_CMAKEDIR}"</span>
<span class="w"> </span><span class="no">COMPONENT</span><span class="w"> </span><span class="nb">SomeLib_Development</span><span class="nf">)</span>
</code></pre></div>
<p>Now we'll see the package config file <code>SomeLibConfig.cmake</code> in all its glory.</p>
<div class="highlight"><pre><span></span><code><span class="nf">cmake_minimum_required(</span><span class="no">VERSION</span><span class="w"> </span><span class="m">3.19</span><span class="nf">)</span>
<span class="nf">set(</span><span class="nb">SomeLib_known_comps</span><span class="w"> </span><span class="nb">static</span><span class="w"> </span><span class="nb">shared</span><span class="nf">)</span>
<span class="nf">set(</span><span class="nb">SomeLib_comp_static</span><span class="w"> </span><span class="no">NO</span><span class="nf">)</span>
<span class="nf">set(</span><span class="nb">SomeLib_comp_shared</span><span class="w"> </span><span class="no">NO</span><span class="nf">)</span>
<span class="nf">foreach</span> <span class="nf">(</span><span class="nb">SomeLib_comp</span><span class="w"> </span><span class="no">IN</span><span class="w"> </span><span class="no">LISTS</span><span class="w"> </span><span class="o">${</span><span class="nt">CMAKE_FIND_PACKAGE_NAME</span><span class="o">}</span><span class="nb">_FIND_COMPONENTS</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">if</span> <span class="nf">(</span><span class="nb">SomeLib_comp</span><span class="w"> </span><span class="no">IN_LIST</span><span class="w"> </span><span class="nb">SomeLib_known_comps</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="nb">SomeLib_comp_</span><span class="o">${</span><span class="nt">SomeLib_comp</span><span class="o">}</span><span class="w"> </span><span class="no">YES</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">else</span> <span class="nf">()</span>
<span class="w"> </span><span class="nf">set(</span><span class="o">${</span><span class="nt">CMAKE_FIND_PACKAGE_NAME</span><span class="o">}</span><span class="nb">_NOT_FOUND_MESSAGE</span>
<span class="w"> </span><span class="s">"SomeLib does not recognize component `${SomeLib_comp}`."</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="o">${</span><span class="nt">CMAKE_FIND_PACKAGE_NAME</span><span class="o">}</span><span class="nb">_FOUND</span><span class="w"> </span><span class="no">FALSE</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">return()</span>
<span class="w"> </span><span class="nf">endif</span> <span class="nf">()</span>
<span class="nf">endforeach</span> <span class="nf">()</span>
<span class="nf">if</span> <span class="nf">(</span><span class="nb">SomeLib_comp_static</span><span class="w"> </span><span class="no">AND</span><span class="w"> </span><span class="nb">SomeLib_comp_shared</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="o">${</span><span class="nt">CMAKE_FIND_PACKAGE_NAME</span><span class="o">}</span><span class="nb">_NOT_FOUND_MESSAGE</span>
<span class="w"> </span><span class="s">"SomeLib `static` and `shared` components are mutually exclusive."</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="o">${</span><span class="nt">CMAKE_FIND_PACKAGE_NAME</span><span class="o">}</span><span class="nb">_FOUND</span><span class="w"> </span><span class="no">FALSE</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">return()</span>
<span class="nf">endif</span> <span class="nf">()</span>
<span class="nf">set(</span><span class="nb">SomeLib_static_targets</span><span class="w"> </span><span class="s">"${CMAKE_CURRENT_LIST_DIR}/SomeLib-static-targets.cmake"</span><span class="nf">)</span>
<span class="nf">set(</span><span class="nb">SomeLib_shared_targets</span><span class="w"> </span><span class="s">"${CMAKE_CURRENT_LIST_DIR}/SomeLib-shared-targets.cmake"</span><span class="nf">)</span>
<span class="nf">macro(</span><span class="nb">SomeLib_load_targets</span><span class="w"> </span><span class="nb">type</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">if</span> <span class="nf">(</span><span class="no">NOT</span><span class="w"> </span><span class="no">EXISTS</span><span class="w"> </span><span class="s">"${SomeLib_${type}_targets}"</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="o">${</span><span class="nt">CMAKE_FIND_PACKAGE_NAME</span><span class="o">}</span><span class="nb">_NOT_FOUND_MESSAGE</span>
<span class="w"> </span><span class="s">"SomeLib `${type}` libraries were requested but not found."</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">set(</span><span class="o">${</span><span class="nt">CMAKE_FIND_PACKAGE_NAME</span><span class="o">}</span><span class="nb">_FOUND</span><span class="w"> </span><span class="no">FALSE</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">return()</span>
<span class="w"> </span><span class="nf">endif</span> <span class="nf">()</span>
<span class="w"> </span><span class="nf">include(</span><span class="s">"${SomeLib_${type}_targets}"</span><span class="nf">)</span>
<span class="nf">endmacro()</span>
<span class="nf">if</span> <span class="nf">(</span><span class="nb">SomeLib_comp_static</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">SomeLib_load_targets(</span><span class="nb">static</span><span class="nf">)</span>
<span class="nf">elseif</span> <span class="nf">(</span><span class="nb">SomeLib_comp_shared</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">SomeLib_load_targets(</span><span class="nb">shared</span><span class="nf">)</span>
<span class="nf">elseif</span> <span class="nf">(</span><span class="no">DEFINED</span><span class="w"> </span><span class="nb">SomeLib_SHARED_LIBS</span><span class="w"> </span><span class="no">AND</span><span class="w"> </span><span class="nb">SomeLib_SHARED_LIBS</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">SomeLib_load_targets(</span><span class="nb">shared</span><span class="nf">)</span>
<span class="nf">elseif</span> <span class="nf">(</span><span class="no">DEFINED</span><span class="w"> </span><span class="nb">SomeLib_SHARED_LIBS</span><span class="w"> </span><span class="no">AND</span><span class="w"> </span><span class="no">NOT</span><span class="w"> </span><span class="nb">SomeLib_SHARED_LIBS</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">SomeLib_load_targets(</span><span class="nb">static</span><span class="nf">)</span>
<span class="nf">elseif</span> <span class="nf">(</span><span class="no">BUILD_SHARED_LIBS</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">if</span> <span class="nf">(</span><span class="no">EXISTS</span><span class="w"> </span><span class="s">"${SomeLib_shared_targets}"</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">SomeLib_load_targets(</span><span class="nb">shared</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">else</span> <span class="nf">()</span>
<span class="w"> </span><span class="nf">SomeLib_load_targets(</span><span class="nb">static</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">endif</span> <span class="nf">()</span>
<span class="nf">else</span> <span class="nf">()</span>
<span class="w"> </span><span class="nf">if</span> <span class="nf">(</span><span class="no">EXISTS</span><span class="w"> </span><span class="s">"${SomeLib_static_targets}"</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">SomeLib_load_targets(</span><span class="nb">static</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">else</span> <span class="nf">()</span>
<span class="w"> </span><span class="nf">SomeLib_load_targets(</span><span class="nb">shared</span><span class="nf">)</span>
<span class="w"> </span><span class="nf">endif</span> <span class="nf">()</span>
<span class="nf">endif</span> <span class="nf">()</span>
</code></pre></div>
<p>There are a few confusing things going on here. First, CMake's package search is
case-insensitive, so we need to look at <code>${CMAKE_FIND_PACKAGE_NAME}</code> to know
the <em>exact</em> name the user requested and therefore what CMake named the input
variables to the package file. If only this were normalized to upper-case, we
could write <code>SOMELIB_FIND_COMPONENTS</code> instead of the ugly mess we have, but
alas.</p>
<p>Still, what's actually happening is rather simple. It checks the components to
see if the user requested either <code>static</code> or <code>shared</code>. If both were, the package
fails and sets an informative error message. If just one was, it tries to load
the corresponding targets file. If the user supplies an invalid component, it
fails, too. Otherwise, it checks <code>SomeLib_SHARED_LIBS</code>, and <code>BUILD_SHARED_LIBS</code>
in turn and defaults to static if nothing is set, which matches common practice.</p>
<p>The package components and <code>SomeLib_SHARED_LIBS</code> variable are considered binding
if set, so the package will fail to be found if the installation does not
contain the requested libraries. However, if only <code>BUILD_SHARED_LIBS</code> is set (or
nothing is set) and only <em>one</em> of the static or shared configuration is
installed, we still load the available library to match existing CMake
practices. If <code>BUILD_SHARED_LIBS</code> is <code>OFF</code> (or not set) and only the shared
libraries are available, then the shared libraries will be loaded.</p>
<h2>Building the Project</h2>
<p><em><strong>Whew.</strong></em> After all that, you'll be happy to know that actually building this
requires nothing special. Here you go (from the source directory):</p>
<div class="highlight"><pre><span></span><code><span class="err">$</span><span class="w"> </span><span class="nb">cmake</span><span class="w"> </span><span class="p">-</span><span class="no">G</span><span class="w"> </span><span class="nb">Ninja</span><span class="w"> </span><span class="p">-</span><span class="no">S</span><span class="w"> </span><span class="p">.</span><span class="w"> </span><span class="p">-</span><span class="no">B</span><span class="w"> </span><span class="nb">build-shared</span><span class="w"> </span><span class="p">-</span><span class="no">DBUILD_SHARED_LIBS</span><span class="p">=</span><span class="no">YES</span><span class="w"> </span><span class="p">-</span><span class="no">DCMAKE_BUILD_TYPE</span><span class="p">=</span><span class="nb">Release</span>
<span class="err">$</span><span class="w"> </span><span class="nb">cmake</span><span class="w"> </span><span class="p">-</span><span class="no">G</span><span class="w"> </span><span class="nb">Ninja</span><span class="w"> </span><span class="p">-</span><span class="no">S</span><span class="w"> </span><span class="p">.</span><span class="w"> </span><span class="p">-</span><span class="no">B</span><span class="w"> </span><span class="nb">build-static</span><span class="w"> </span><span class="p">-</span><span class="no">DBUILD_SHARED_LIBS</span><span class="p">=</span><span class="no">NO</span><span class="w"> </span><span class="p">-</span><span class="no">DCMAKE_BUILD_TYPE</span><span class="p">=</span><span class="nb">Release</span>
<span class="err">$</span><span class="w"> </span><span class="nb">cmake</span><span class="w"> </span><span class="p">--</span><span class="nb">build</span><span class="w"> </span><span class="nb">build-shared</span>
<span class="err">$</span><span class="w"> </span><span class="nb">cmake</span><span class="w"> </span><span class="p">--</span><span class="nb">build</span><span class="w"> </span><span class="nb">build-static</span>
<span class="err">$</span><span class="w"> </span><span class="nb">cmake</span><span class="w"> </span><span class="p">--</span><span class="nb">install</span><span class="w"> </span><span class="nb">build-shared</span><span class="w"> </span><span class="p">--</span><span class="nb">prefix</span><span class="w"> </span><span class="nb">_install</span>
<span class="err">$</span><span class="w"> </span><span class="nb">cmake</span><span class="w"> </span><span class="p">--</span><span class="nb">install</span><span class="w"> </span><span class="nb">build-static</span><span class="w"> </span><span class="p">--</span><span class="nb">prefix</span><span class="w"> </span><span class="nb">_install</span>
</code></pre></div>
<p>None of this should be surprising. We build and install both library types in
Release mode to a common prefix. On Windows, we need to be careful that the
static library <code>.lib</code> file does not conflict with the shared library's <code>.lib</code>
import library. We can work around this by
adding <code>-DCMAKE_RELEASE_POSTFIX=_static</code> to the configure step for the static
library. That way we'll get <code>SomeLib_static.lib</code> from the static build and the
usual <code>SomeLib.dll</code> plus <code>SomeLib.lib</code> combination from the shared build.</p>
<p>Now we can write a little program that calls it. Here's the test:</p>
<div class="highlight"><pre><span></span><code><span class="c1">// main.cpp</span>
<span class="cp">#include</span><span class="w"> </span><span class="cpf"><iostream></span>
<span class="cp">#include</span><span class="w"> </span><span class="cpf"><somelib/random.h></span>
<span class="kt">int</span><span class="w"> </span><span class="nf">main</span><span class="p">()</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">cout</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="s">"My very random number is: "</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="n">SomeLib</span><span class="o">::</span><span class="n">getRandomNumber</span><span class="p">()</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div>
<p>Here's the <code>CMakeLists.txt</code>:</p>
<div class="highlight"><pre><span></span><code><span class="nf">cmake_minimum_required(</span><span class="no">VERSION</span><span class="w"> </span><span class="m">3.19</span><span class="nf">)</span>
<span class="nf">project(</span><span class="nb">example</span><span class="nf">)</span>
<span class="nf">enable_testing()</span>
<span class="nf">find_package(</span><span class="nb">SomeLib</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="no">REQUIRED</span><span class="nf">)</span>
<span class="nf">add_executable(</span><span class="nb">main</span><span class="w"> </span><span class="nb">main.cpp</span><span class="nf">)</span>
<span class="nf">target_link_libraries(</span><span class="nb">main</span><span class="w"> </span><span class="no">PRIVATE</span><span class="w"> </span><span class="nb">SomeLib</span><span class="o">::</span><span class="nb">SomeLib</span><span class="nf">)</span>
<span class="nf">add_test(</span><span class="no">NAME</span><span class="w"> </span><span class="nb">random_is_42</span><span class="w"> </span><span class="no">COMMAND</span><span class="w"> </span><span class="nb">main</span><span class="nf">)</span>
<span class="nf">set_tests_properties(</span><span class="nb">random_is_42</span><span class="w"> </span><span class="no">PROPERTIES</span>
<span class="w"> </span><span class="no">PASS_REGULAR_EXPRESSION</span><span class="w"> </span><span class="s">"is: 42"</span>
<span class="w"> </span><span class="no">ENVIRONMENT</span><span class="w"> </span><span class="s">"PATH=$<TARGET_FILE_DIR:SomeLib::SomeLib>"</span><span class="nf">)</span>
</code></pre></div>
<p>It also includes a little test to make sure that our <em>very</em> random number was
indeed returned. We can build it several ways and verify with <code>ldd</code> (on Linux,
at least) that it was linked correctly.</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>cmake<span class="w"> </span>-G<span class="w"> </span>Ninja<span class="w"> </span>-S<span class="w"> </span>.<span class="w"> </span>-B<span class="w"> </span>build<span class="w"> </span>-DCMAKE_PREFIX_PATH<span class="o">=</span>/path/to/_install
<span class="go">-- The C compiler identification is GNU 9.3.0</span>
<span class="go">-- The CXX compiler identification is GNU 9.3.0</span>
<span class="go">-- Detecting C compiler ABI info</span>
<span class="go">-- Detecting C compiler ABI info - done</span>
<span class="go">-- Check for working C compiler: /usr/bin/cc - skipped</span>
<span class="go">-- Detecting C compile features</span>
<span class="go">-- Detecting C compile features - done</span>
<span class="go">-- Detecting CXX compiler ABI info</span>
<span class="go">-- Detecting CXX compiler ABI info - done</span>
<span class="go">-- Check for working CXX compiler: /usr/bin/c++ - skipped</span>
<span class="go">-- Detecting CXX compile features</span>
<span class="go">-- Detecting CXX compile features - done</span>
<span class="go">-- Configuring done</span>
<span class="go">-- Generating done</span>
<span class="go">-- Build files have been written to: /path/to/build</span>
<span class="gp">$ </span>cmake<span class="w"> </span>--build<span class="w"> </span>build
<span class="go">[1/2] /usr/bin/c++ -DSOMELIB_STATIC_DEFINE -isystem /path/to/_install/include ↩</span>
<span class="go"> -MD -MT CMakeFiles/main.dir/main.cpp.o -MF CMakeFiles/main.dir/main.cpp.o.d ↩</span>
<span class="go"> -o CMakeFiles/main.dir/main.cpp.o -c ../main.cpp</span>
<span class="go">[2/2] : && /usr/bin/c++ CMakeFiles/main.dir/main.cpp.o -o main ↩</span>
<span class="go"> /path/to/_install/lib/libSomeLib.a && :</span>
<span class="gp">$ </span>./build/main
<span class="go">My very random number is: 42</span>
<span class="gp">$ </span>ldd<span class="w"> </span>build/main<span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>SomeLib
<span class="gp">$ </span>cmake<span class="w"> </span>-B<span class="w"> </span>build<span class="w"> </span>-DBUILD_SHARED_LIBS<span class="o">=</span>YES
<span class="go">-- Configuring done</span>
<span class="go">-- Generating done</span>
<span class="go">-- Build files have been written to: /path/to/build</span>
<span class="gp">$ </span>cmake<span class="w"> </span>--build<span class="w"> </span>build
<span class="go">[1/2] /usr/bin/c++ -isystem /path/to/_install/include -MD -MT ↩</span>
<span class="go"> CMakeFiles/main.dir/main.cpp.o -MF CMakeFiles/main.dir/main.cpp.o.d -o ↩</span>
<span class="go"> CMakeFiles/main.dir/main.cpp.o -c ../main.cpp</span>
<span class="go">[2/2] : && /usr/bin/c++ CMakeFiles/main.dir/main.cpp.o -o main ↩</span>
<span class="go"> -Wl,-rpath,/path/to/_install/lib /path/to/_install/lib/libSomeLib.so.1.0.0 ↩</span>
<span class="go"> && :</span>
<span class="gp">$ </span>./build/main
<span class="go">My very random number is: 42</span>
<span class="gp">$ </span>ldd<span class="w"> </span>build/main<span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>SomeLib
<span class="go"> libSomeLib.so.1 => /path/to/libSomeLib.so.1 (0x00007f41880ae000)</span>
</code></pre></div>
<p>The associated <a href="https://github.com/alexreinking/SharedStaticStarter">GitHub repo</a> has a simple GitHub Actions workflow to test the
package.</p>
<h2>Conclusion</h2>
<p>There's a lot awkward about CMake, and it's definitely on display here. Even so,
the actual solution itself is simple, even if the implementation has some warts.
Most importantly, the complexity is all placed on the library <em>author</em>, not on
the library <em>user</em>. A lot of this can be set up and forgotten, and the little
pain now is well worth sparing all the downstream users, support staff,
StackOverflow volunteers, and so on a far greater amount of pain.</p>Perceus: Garbage Free Reference Counting with Reuse2021-01-25T12:00:00-08:002021-01-25T12:00:00-08:00Alex Reinkingtag:alexreinking.com,2021-01-25:/papers/perceus-garbage-free-reference-counting-with-reuse.html<p><strong>PLDI 2021 Distinguished Paper</strong></p>
<p><a href="https://www.microsoft.com/en-us/research/uploads/prod/2020/11/perceus-tr-v4.pdf">Link to paper</a></p>
<p>We introduce Perceus, an algorithm for precise reference counting with reuse and
specialization. Starting from a functional core language with explicit
control-flow, Perceus emits precise reference counting instructions such that
programs are garbage-free, where only live references are retained.This enables
further optimizations …</p><p><strong>PLDI 2021 Distinguished Paper</strong></p>
<p><a href="https://www.microsoft.com/en-us/research/uploads/prod/2020/11/perceus-tr-v4.pdf">Link to paper</a></p>
<p>We introduce Perceus, an algorithm for precise reference counting with reuse and
specialization. Starting from a functional core language with explicit
control-flow, Perceus emits precise reference counting instructions such that
programs are garbage-free, where only live references are retained.This enables
further optimizations, like reuse analysis that allows for guaranteed in-place
updates at runtime. This in turn enables a novel programming paradigm that we
call functional but in-place (FBIP). Much like tail-call optimization enables
writing loops with regular function calls, reuse analysis enables writing
in-place mutating algorithms in a purely functional way. We give a novel
formalization of reference counting in a linear resource calculus, and prove
that Perceus is sound and garbage free. We show evidence that Perceus, as
implemented in Koka, has good performance and is competitive with other
state-of-the-art memory collectors.</p>Formal Semantics for the Halide Language2020-05-01T12:00:00-07:002020-05-01T12:00:00-07:00Alex Reinkingtag:alexreinking.com,2020-05-01:/papers/formal-semantics-for-the-halide-language.html<p><a href="https://arxiv.org/abs/2210.15740">Pre-print on arXiv</a></p>
<p>We present the first formalization and metatheory of language soundness for a
user-schedulable language, the widely used array processing language Halide.
User-schedulable languages strike a balance between abstraction and control in
high-performance computing by separating the specification of what a program
should compute from a schedule for …</p><p><a href="https://arxiv.org/abs/2210.15740">Pre-print on arXiv</a></p>
<p>We present the first formalization and metatheory of language soundness for a
user-schedulable language, the widely used array processing language Halide.
User-schedulable languages strike a balance between abstraction and control in
high-performance computing by separating the specification of what a program
should compute from a schedule for how to compute it. In the process, they make
a novel language soundness claim: the result of a program should always be the
same, regardless of how it is scheduled. This soundness guarantee is tricky to
provide in the presence of schedules that introduce redundant recomputation and
computation on uninitialized data, rather than simply reordering statements. In
addition, Halide ensures memory safety through a compile-time bounds inference
engine that determines safe sizes for every buffer and loop in the generated
code, presenting a novel challenge: formalizing and analyzing a language
specification that depends on the results of unreliable program synthesis
algorithms. Our formalization has revealed flaws and led to improvements in the
practical Halide system, and we believe it provides a foundation for the design
of new languages and tools that apply programmer-controlled scheduling to other
domains.</p>A Type-Directed Approach to Program Repair2015-07-16T02:48:26-07:002015-07-16T02:48:26-07:00Alex Reinkingtag:alexreinking.com,2015-07-16:/papers/a-type-directed-approach-to-program-repair.html<p>Published at CAV 2015.</p>
<p><a href="https://link.springer.com/chapter/10.1007/978-3-319-21690-4_35">Link to paper</a></p>
<p>Developing enterprise software often requires composing several libraries
together with a large body of in-house code. Large APIs introduce a steep
learning curve for new developers as a result of their complex object-oriented
underpinnings. While the written code in general reflects a programmer …</p><p>Published at CAV 2015.</p>
<p><a href="https://link.springer.com/chapter/10.1007/978-3-319-21690-4_35">Link to paper</a></p>
<p>Developing enterprise software often requires composing several libraries
together with a large body of in-house code. Large APIs introduce a steep
learning curve for new developers as a result of their complex object-oriented
underpinnings. While the written code in general reflects a programmer’s intent,
due to evolutions in an API, code can often become ill-typed, yet still
syntactically-correct. Such code fragments will no longer compile, and will need
to be updated. We describe an algorithm that automatically repairs such errors,
and discuss its application to common problems in software engineering.</p>