jlebar

"Hardware completeness" in compiler IRs

2024-02-04T00:00:00-07:00

In this post I define the concept of "hardware completeness" and show its relevance to the design of programming languages and compiler IRs.

Definition of hardware-completeness

Let's say a programming language or compiler intermediate representation (IR) is "hardware-complete" on hardware platform X if you can use it to generate code that's equivalent to any valid machine code program for platform X. For our purposes, "equivalent to" measures both functionality and performance.

Some examples of hardware-complete languages are:

C (if you include inline asm or hardware-specific intrinsics),
CUDA C++,
PTX, Nvidia's high-level assembly language for their GPUs,
LLVM IR, and
x86 assembly code (an x86 CPU is an x86-asm-to-microcode optimizing compiler implemented in hardware).

Of course, you can't always get 100% of the performance of handwritten machine code out of, say, a C program—at least, not without replacing your whole program with a giant block of inline asm (although I'd argue the fact that you can do that is part of what makes C hardware-complete). But philosophically, hardware-complete languages try to make it possible to get close to "100% performance", and if you can't, it's probably a bug.

In the world of machine learning, we have many languages and compiler IRs which are not hardware-complete. Some examples of hardware-incomplete languages for ML are:

NumPy,
JAX,
XLA,
Triton, and
TensorRT.

These are incomplete because there are many valid CPU and GPU programs that you cannot write using these languages. Even if the language is Turing-complete so you can always get functional equivalence, it would often come at the cost of a huge slowdown compared to handwritten assembly (or C).

Implications of (in)completeness

Using the word "incomplete" might make it sound like a bad thing, but hardware-incomplete IRs have many benefits, including:

Incomplete languages usually target a specific domain (e.g. ML) so (hopefully!) it's more pleasant to write domain-specific code in one of these languages than a complete/general-purpose language.
Incomplete languages usually have restrictions that make it easier to do compiler optimizations.
It's easier to make incomplete languages portable to different hardware, because you're not aiming to generate any valid program on the target hardware. (For example, LLVM IR is not actually portable; you can't take LLVM IR generated for x86 and expect to run it on a GPU.)

But one negative property about incomplete IRs is, it's hard to translate from one to another. That's because in order to translate from language A to B, all machine code programs that can be generated by A must be generateable by B, and with roughly-equivalent performance. This works by definition if B is complete, and usually doesn't work if B is incomplete. And even if it works at any one point in time, incomplete languages are always changing (because you are always finding new programs you want to generate), and it's vanishingly unlikely that they will change in the same way and at the same pace.

For example, although XLA, ONNX, and TensorRT all try to cover roughly the same subset of valid programs on nvidia GPUs, in practice they do not overlap perfectly, so it's not possible to translate an arbitrary XLA program to ONNX or TensorRT.

Another fact is, because XLA, ONNX, and TensorRT all try to cover roughly the same subset of GPU programs, they look very similar. So similar you might be able to convince yourself that you can translate between them. And you'll be able to get 90% of the way there before you understand the differences well enough to realize how doomed you are.

Of course, if a programming language is designed around the limitations of an incomplete IR, that can work. For example, JAX (originally, "Just Another XLA") is designed around the limitations of XLA, so it can lower to XLA without issue. But it would be hard to lower JAX to a different incomplete IR with slightly-different limitations. Indeed many incomplete IRs have their own dedicated language. For example, the Triton language is tightly coupled to Triton's IR.

Conclusions

The above was just definitions and facts. What can we conclude from it?

Upshot 1: As an ML compilers person, I frequently have the following conversation with hardware people.

Them: We have developed a new IR and compiler for machine learning. We will take your IR (XLA, Triton IR, whatever) and lower it to our new IR. Then all of your programs will get great performance on our hardware.
Me: Thanks but no thanks. I won't be able to lower all Triton programs to your IR.
Them: I don't understand, you lower XLA/Triton programs to LLVM IR today, right? What's the difference?

The fundamental problem is that the hardware person is asking me to lower my incomplete IR to another incomplete IR. For the reasons above, this is unappealing and probably won't work.

Upshot 2: If you're designing an incomplete IR, I'd say you probably should design a programming language to go with it. This way you can enforce your IR's limitations at the level of user code, instead of having performance and functionality cliffs that are unpredictable to your users and that depend on which compiler backend they're using. I think this tight coupling between language and IR may be one of the reasons JAX and Triton have been so successful.

A corollary of this is that your incomplete IR probably won't be relevant unless its associated programming language is successful. At which point, you should probably think about the programming language first and the IR second.

Upshot 3: This also explains my skepticism of "ML compiler hourglass diagrams". Behold.

In all of these cases, the narrow part of the hourglass is an incomplete language, so it has all the problems I've described above.

Thanks to Kyle Huey and Mehdi Amini for feedback on drafts of this post.

ML Convolutions Explained Differently

2023-09-11T00:00:00-07:00

It took me years before I understood exactly what a convolution is in the machine learning context. I understood the simple "filter moving over an image" convolution, but I couldn't figure out how to translate that into the operation that's performed by machine learning.

I read posts that used complex 4D diagrams (3D plus animation!) or 6-deep nested for loops. Nothing against that if it makes sense to you, but it never did to me.

It turns out there's a simple transformation from a basic convolution to an ML convolution. I haven't seen it explained this way anywhere else, so this is what I want to show in this post. Maybe it will help you like it helped me.

As a bonus, I'll show how this perspective makes it straightforward to observe the fact that 1x1 ML convolutions are just matrix multiplications.

Basic Convolutions

To review, a "basic convolution" is a function which takes as input two 2D arrays of real numbers:

the "image", with dimensions
- \( H_i \) — input height
- \( W_i \) — input width, and
the "filter" (also known as the "kernel"), with dimensions
- \( H_f \) — filter height
- \( W_f \) — filter width.

We expect the filter to be smaller than the image.

The basic convolution drags the filter over the image. At each point of overlap, we produce one output element by multiplying the values in the filter by the values in the image that are "beneath" them and adding up these products.

When the filter is in the top left of the input image, this process produces the top-leftmost output point. If the filter moves (say) one element rightward, we now generate the output point one element to the right from the top-left.

We thus produce an output image of size \( (H_o, W_o) \) (output height/width). The output is a little smaller than the input image because the kernel can't exceed the bounds of the input. (You can fix this by padding the input with 0s around the edges.)

To make up some notation describing the dimensions of the input/output arrays, we might write

\[ \operatorname{ConvBasic}( (H_i, W_i),\ (H_f, W_f) ) \rightarrow (H_o, W_o). \]

If you're not familiar with all this, check out this explainer, up to but not including the section entitled "the multi-channel version".

ML Convolutions

The basic convolution is simple enough. But an "ML convolution" is a different beast.

An ML convolution is an operation which operates on 4D arrays of real numbers.

Input 1: the "image", with dimensions
- \( N \) — batch
- \( H_i \) — input height
- \( W_i \) — input width
- \( C_i \) — input channels,
Input 2: "filter", with dimensions
- \( H_f \) — filter height
- \( W_f \) — filter width
- \( C_i \) — input channels
- \( C_o \) — output channels, and
Output: the "output image", with dimensions
- \( N \) — batch
- \( H_o \) — output height
- \( W_o \) — output width
- \( C_o \) — output channels.

We can write this in our notation from above as

\[ \operatorname{MLConv}( (N, H_i, W_i, C_i),\ (H_f, W_f, C_i, C_o) ) \rightarrow (N, H_o, W_o, C_o). \]

Yikes, that's a lot of dimensions. What do they all mean, and how do they interact?

That's what I want to break down in this post. We'll take \( \operatorname{ConvBasic} \) and add one dimension at a time, until we have the full \( \operatorname{MLConv} \) function.

Input Channels

First, let's add input channels to the image.

Consider the elements of the image. Instead of them being real numbers, suppose they're vectors in \( \mathbb{R}^n \). For example, maybe your image is RGB; in this case, \( n=3 \). Or maybe your image is the result of an intermediate layer in a convolutional net; in this case, maybe \( n=128 \).

How do we "multiply" the elements of the image by the filter? One option is to let the elements of the filter also be vectors in \( \mathbb{R}^n \). Now when we multiply an element of the input by an element of the filter, we take their dot product, giving us a real number. Then like before, we add up all these dot products to get the output element, which is also a real number.

As you might have guessed, \( n \) is called the number of "input channels". It corresponds to dim \( C_i \).

So now we have a function which takes as input two 2D arrays, of size \( (H_i, W_i) \) and \( (H_f, W_f) \), where each element is a vector in \( \mathbb{R}^{C_i} \). Or equivalently, we take two 3D arrays of real numbers, of dimensions \( (H_i, W_i, C_i) \) and \( (H_f, W_f, C_i) \).

This operation still outputs a 2D array of real numbers of size \( (H_o, W_o) \), where the output height/width are a bit smaller than the input.

In our notation, this is

\[ \operatorname{ConvInputChannels}( (H_i, W_i, C_i),\ (H_f, W_f, C_i) ) \rightarrow (H_o, W_o). \]

Output Channels

Above, our output image is only 2D. Let's add the \( C_o \) ("output channels") dimension to make it 3D.

Suppose we create \( C_o \) independent filters and repeat the process above once for each filter. This results in \( C_o \) 2D output images.

Now the output is a 3D array of dimension \( (H_o, W_o, C_o) \). The filter also gains an additional dimension, \( (H_f, W_f, C_i, C_o) \).

Notice that we take the dot product of each element of the input image, a vector in \( \mathbb{R}^{C_i} \), with \( C_o \) different vectors from the filter. You can think of this as taking a repeated dot product, or equivalently as a matrix-vector multiplication.

In our notation, we now have

\[ \operatorname{ConvInOutChannels}( (H_i, W_i, C_i),\ (H_f, W_f, C_i, C_o) ) \rightarrow (H_o, W_o, C_o). \]

Batch Dimension

The only remaining dimension to handle is the "batch" dimension \( N \).

To add this, we have \( N \) independent input images which generate \( N \) independent output images. This is the batch dimension.

Now each output element is a \( (C_o, N) \) matrix, the sum of matrices formed by multiplying a \( (C_o, C_i) \) filter element by a \( (C_i, N) \) matrix.

As before, we repeat this process for each of the \( H_o, W_o \) output elements.

We now have all four dimensions, so we can properly call this \( \operatorname{MLConv} \):

\[ \operatorname{MLConv}( (N, H_i, W_i, C_i),\ (H_f, W_f, C_i, C_o) ) \rightarrow (N, H_o, W_o, C_o). \]

Notice that the only difference between these four functions is in how we "multiply" elements.

In \( \operatorname{ConvBasic} \) we multiply two scalars.
In \( \operatorname{ConvInputChannels} \) we multiply two vectors.
In \( \operatorname{ConvInOutChannels} \) we multiply a matrix and a vector.
Finally, in \( \operatorname{MLConv} \) we multiply two matrices.

That's all an ML convolution is!

1x1 Convolutions

Now for our bonus fact. Suppose the filter's height and width are 1 — i.e. \( H_f = W_f = 1 \). I claim that this convolution is just a matrix multiplication.

First, observe that the batch, image height, and image width dimensions are interchangeable in a convolution with a 1x1 filter. That is, because there's no "mixing" along the height/width, we can reshape a \( (N, H_i, W_i, C_i) \) input into \( ( N \times H_i \times W_i, 1, 1, C_i ) \), do the conv, and then reshape back to the desired output shape.

So it's sufficient to show that a conv with 1x1 filter and input height/width 1 is a matmul.

But look at the diagram for "Batch Dimension" above. If the input height and width are 1, then there's only one output element. And if the filter size is 1x1, then there's only one term on the RHS, so the one output element is computed as a matmul. Thus, the whole operation is equivalent to a single matrix multiply! □

Thanks to George Karpenkov, Marek Kolodziej, Blake Hechtman, Kyle Huey, and Alexander Zinoviev for their feedback on drafts of this post.

A Practical Introduction to C++11 Rvalue References

2016-05-28T00:00:00-07:00

rvalue references are one of the most powerful new features in C++11. But they're also, in my experience from teaching classes at work and reviewing code, one of the most difficult to grok.

I've come to think this isn't so much because rvalue references are particularly complex (by C++'s standards, anyway). Rather, much of the difficulty seems to stem from the fact that much of their workings have no syntax attached. If you don't know the implicit rules, rvalue references just look like magic.

In this post, I'm going to try to motivate and explain rvalue references using practical examples. My hope is that this will demystify both how they work, and why they were designed this way.

Note to language lawyers: I'm intentionally playing fast and loose with some parts of the standard here. "Practical" is the key word in the title. :)

A Troubling Inefficiency

Imagine you're writing a simple stack class, based on a singly-linked list.

template
class Stack {
 private:
  struct Node {
    T elem;
    Node* next;
  };

  Node* first;

 public:
  void push_front(const T& t) {
    Node* n = new Node();
    n->elem = t;
    n->next = first;
    first = n;
  }

  // pop_front, constructors omitted.
};

Now suppose you use your class as follows.

Stack s;
s.push_front(get_string());

Let's break down what this does:

get_string() creates and returns a string. This involves malloc'ing a buffer on the heap and writing some data into it.
push_front() copies the given string into a newly-allocated Node (n->elem = t;). This involves malloc'ing a buffer for the new string's data and memcpy'ing the original string's data into the new buffer.
push_front() returns, and we destroy the original string, free'ing its buffer.

This is frustratingly inefficient. You and I know that the string returned by get_string() is going to go away right after the call to push_front. Why do we need to malloc a new buffer and free the old buffer, when we could simply steal the old buffer for our own purposes? This would let push_front avoid calling malloc, free, and memcpy entirely, for a big potential speedup.

Let's first look at how we'd solve this inefficiency without C++11.

Stack Hack

We want to let users of our class get this faster behavior, so before C++11, we might write a new member function.

void push_front_destructive(T& t) {
  Node* n = new Node();
  using std::swap;
  swap(n->elem, t);
  n->next = first;
  first = n;
}

(This business with using std::swap is an ADL trick; it lets us call a specialized version of swap if one exists for the type T, otherwise it calls std::swap.)

This is great, it runs in O(1) time, and steals the pointer just like we want.

Now we try to use our stack like we did originally.

Stack s;
s.push_front_destructive(get_string());  // ERROR

But this is a compile error! The problem is that C++ does not let you bind a temporary value (get_string()) to a non-const reference (t). One of the creators of C++ explains that this decision was made to prevent certain kinds of bugs. We might disagree, but whatever, it's the rule, and it gets in our way here.

We can still make this work, but it's a bit ugly:

Stack s;
string str = get_string();
s.push_front_destructive(str);

This is OK because now we're asking the the non-const reference arg in push_front_destructive to bind to the non-temporary str.

(You may have noticed that I haven't defined what a "temporary" actually is. The exact definition is...complicated. For our purposes, it's a sufficient model to say that something is a temporary if you can only use it once. So for example, get_string() is a temporary, because you can do get_string().c_str() or get_string().length() or whatever, but after you do that one thing, you can't go back to the original object and do a second thing. In contrast, if I have a string x, that's not a temporary because I can touch x multiple times.)

This is sort of the best we can do without C++11. It's not awful, but there are a few things that could be improved. For one thing, having to assign get_string into a named variable is ugly. In addition, we have to remember to call push_front_destructive rather than the regular push_front.

But from the perspective of C++'s design philosophy, the worst part may be that we still do unnecessary work. It's not as much as before, but we have to default-construct a T before swapping into it, and we have to do both sides of our swap, even though we only care about the final value of n->elem. Both of these steps can be expensive on arbitrary objects.

The Reference Formerly Known as &&

So here's where C++11 comes in to solve our problems. C++11 introduces a new reference type, called the rvalue reference. It is exactly like a non-const reference, except it's only allowed to refer to temporaries. (Recall that a regular non-const reference can refer only to non-temporaries.)

Again, to emphasize: An rvalue reference is basically the same as a regular (non-const) reference. The only difference is in the referent: A regular non-const reference always points to a non-temporary, while an rvalue reference always points to a temporary.

Using just rvalue references and nothing else new, we can add an overload of push_front that accepts only temporary objects. This overload can destructively modify its inputs at will, because (the assumption is) they're going away anyways.

void push_front(T&& t) {
  Node n = new Node();
  using std::swap;
  swap(n->elem, t);
  n->next = first;
  first = n;
}

(An rvalue reference is spelled &&. The double-ampersand is a single unit; don't try to read it as "reference-to-a-reference".)

The body of our new function is exactly the same as push_front_destructive, because rvalue references behave exactly like regular references.

With this new overload, we can now write code that uses our stack like the following.

Stack s;
s.push_front(get_string());  // Swaps
string str = ...;
s.push_front(str);           // Copies

We now have two overloads of push_front: A version which takes a const reference (which can bind to a temporary or a non-temporary) and doesn't modify its input, and a version which takes an rvalue reference (which can only bind to a temporary) and swaps its input. The first call to push_front passes a temporary, so it can bind to either overload. The compiler picks the rvalue ref one, because it's more specific. The second call to push_front passes a non-temporary, so it can only bind to our original function, which copies.

std::move

Consider again the most recent use of our stack class above. Suppose we wanted to change it so we invoke the new rvalue ref overload for the second call to push_front, because we don't care about the value of str after push_front completes. C++11 adds a mechanism for us to do this: std::move. It looks like this.

string str = ...;
s.push_front(std::move(str));  // Swaps

All std::move does is imbue its argument with temporaryness. I know, it's confusing, because "move" is a verb; it sounds like it should do something. But in fact it's just a cast, just like const_cast or reinterpret_cast. Like those casts, std::move just lets us use an object in a different way; on its own, it doesn't do anything.

Now that we have rvalue references and std::move, we've solved some of the problems with our original push_front_destructive. We no longer have to name our temporaries and remember to call push_front_destructive in order to get the faster behavior. This is great! But our new overload of push_front still isn't as efficient as it could be. We're still default-constructing a T, and we're still swap'ing, even though we only care about one side.

The solution? We'll apply this same technique of overloading functions that take const references, but this time to constructors.

A Movable Constructor

Consider std::string. It's always had a copy constructor, which takes its argument by const reference. Suppose we added a new constructor, which takes its argument by rvalue reference. It might look like this.

class string {
 public:
  string(const string& other);  // Copy constructor, exists pre C++11

  string(string&& other) {      // Move constructor, new in C++11
    length = other.length;
    capacity = other.capacity;
    data = other.data;
    other.data = nullptr;
  }

 private:
  size_t length;
  size_t capacity;
  const char* data;
};

This new constructor is called a move constructor. Our implementation takes a temporary string, steals its pointers, and copies its non-pointer members. Note that we have to set other.data = nullptr, otherwise when other is destroyed, it will free its data pointer, and our data pointer will then be stale. (We might also have to modify string's destructor to handle a null data pointer.)

Now when we create a string, we can play the same overloading game we did with push_front. If we pass a non-temporary, we call string's copy constructor (which malloc's a new buffer and so on), but if we pass a temporary, we call the move constructor.

string a(get_string());  // move constructor
string b(a);             // copy constructor
string c(std::move(b));  // move constructor

Note that if string didn't have a move constructor, the line string c(std::move(b)) would just call the copy constructor, because its const string& arg happily binds to temporaries. But everything in the C++ standard library has been updated to have move constructors where appropriate.

The addition of move constructors is a particularly cool change to the standard library. The line string a(get_string()); could have been written pre-C++11, and it would have resulted in an O(n) copy. But now, just by compiling our code with -std=c++11, it's an O(1) move instead! Getting asymptotic speedups in your code is is impressive for a backwards-compatible language change.

In what you're probably noticing is a pattern, this same trick can be applied to the assignment operator, operator=. Here's what that looks like:

class string {
 public:
  string& operator=(const string& other); // Copy assn operator, pre C++11

  string& operator=(string&& other) {     // Move assn operator, new in C++11
    length = other.length;
    capacity = other.capacity;
    delete data;  // OK even if data is null
    data = other.data;
    other.data = nullptr;
    return *this;
  }
};

string a, b;
a = get_string();  // Move assignment
a = b;             // Copy assignment
a = std::move(b);  // Move assignment

Just like the pre-C++11 compiler would try to auto-generate a copy constructor and copy assignment operator for you, the C++11 compiler will try to auto-generate a move constructor and move assignment operator for you. The rules for when you get one of these are complicated, but the main one is that if you define a custom copy constructor, the compiler won't auto-generate a move constructor or move assignment operator for you. You can use =default if you want to override this behavior, and you can use =delete if you want to explicitly prevent the compiler from auto-generating one of these functions.

In most general-purpose code, you're likely going to interact with rvalue references pretty much exclusively via move constructors and move operators. Non-constructor functions like push_front that take rvalue references are relatively rare, and mainly used in collections, wrapper classes, and for perfect forwarding in functions that wrap constructors, such as std::make_shared.

After all that explanation, I can finally show you how using move constructors solves the remaining problems in our push_front implementation.

Back to push_front

So let's come back to our push_front implementation. To remind you, we currently have

void push_front(T&& t) {
  Node* n = new Node();
  using std::swap;
  swap(n->elem, t);
  n->next = first;
  first = n;
}

Our goals are

replace the swap with a move (move is one sided, so is more efficient in general), and
avoid default-constructing a T inside the Node constructor (since, why bother).

(1) is easy, we can use the move-assignment operator, as follows.

void push_front(T&& t) {
  Node* n = new Node();
  n->elem = std::move(t);
  n->next = first;
  first = n;
}

The thing to note here is that we do need the std::move, if we want to call the move assignment operator, even though t is an rvalue reference. Recall that rvalue references are just like regular references; the only thing that's different is that we know the referent is a temporary. In particular, t itself is not a temporary!

A surprising consequence of this can be seen in the following code.

void foo(const string&) {}
void foo(string&& s) { foo(s); }

foo("bar");

This is not infinite recursion, because s is not a temporary! An rvalue reference merely points to a temporary; it is not a temporary itself. This is weird, I know, but it would be weirder if it were the other way around. (Exactly why is left as an exercise to the reader.)

Okay, we have (1). What about (2)? To avoid default-initializing Node::elem, we need to define and call a constructor on Node that moves t. It might look like this.

struct Node {
  Node(T t, Node* next) : elem(std::move(t)), next(next) {}
  T elem;
  Node* next;
};

void push_front(T&& t) {
  first = new Node(std::move(t), first);
}

This still isn't perfect, as we actually move-construct a T twice: Once inside push_front when we call the Node constructor, and again within the Node constructor. In general this isn't a big deal, and the compiler can likely optimize it away. But if you really cared, you could modify Node's constructor to take a T&&. You would still need the std::move!

(You couldn't avoid writing this constructor on Node by using aggregate initialization, as that always copies its elements. I have no idea why that choice was made.)

And that's it, we're done! Easy, right? :)

std::unique_ptr

Well, we're not quite done. No discussion of rvalue references would be complete without talking about std::unique_ptr. unique_ptr is a move-only class: You can't copy one (because then the pointer it wraps would not be "unique"). We say unique_ptr "owns" a pointer, because when the unique_ptr goes away, it deletes its pointer.

An idiom you'll probably encounter with unique_ptr (and other classes, even ones that are not move-only) is passing unique_ptr by value to indicate a transfer of ownership.

void foo(unique_ptr bar);

unique_ptr my_bar;
foo(std::move(my_bar));

Here we give foo our copy of my_bar. foo now has the only owning copy, and can do whatever it wants with it (e.g. store it in a global variable somewhere).

Your first question here might be, what's the value of my_bar after we call foo? The answer depends on what unique_ptr's move constructor does! Hypothetically, it could do basically anything. It could leave my_bar in a state such that you get undefined behavior if you call any functions on it. The only requirement is that we can safely run my_bar's destructor.

Some classes are more constrained in their behavior after being moved from. According to the spec, a unique_ptr is null after it's moved from. std::string is in a "valid, but unspecified state", meaning, you can safely call functions on the string, but who knows what you'll get.

Another common question I get is, if std::move isn't where we move my_bar, where the heck does the move happen? My model is as follows. Before we start running foo, we execute a function prelude, which sets up foo's arguments. In particular, for each parameter P in foo's declaration, we construct a new P inside the scope of our call to foo, using the argument passed to the call.

So in the case above, where foo takes a unique_ptr bar parameter by value and we do foo(std::move(my_bar)), it's as though foo executes unique_ptr bar(std::move(my_bar)) in its prelude. This is where the move occurs. (Note that if foo took its parameter by, say, rvalue reference, then the same call to foo would result in us doing unique_ptr&& bar(std::move(my_bar)) in the prelude, which is just binding a reference; it's not a copy or a move.)

Something similar happens when you return a value: In the function epilogue, we construct a new value in the calling scope using the value you're returning. There's a wrinkle with C++11, however, which is that when you return foo, if foo is a local variable or function argument, it is implicitly a temporary. You don't need to return std::move(foo) to get efficient move construction of the return value.

In fact, you don't want to return std::move(foo), because (hilariously) that prevents RVO. If your type is efficiently movable, this probably isn't a big deal. But if your type is not movable, then you end up with a full-blown copy! (RVO would have elided this copy, making it free.) clang 3.7's -Wpessimizing-move will warn you about this footgun. You do still need std::move if you're doing something like return std::move(foo.bar) – only arguments and local variables, not their members, are implicitly treated as temporaries in return statements.

unique_ptr idioms

unique_ptr is awesome because it lets you avoid a lot of bugs, and it lets you make any class efficiently movable. And in fact since most of your code isn't performance-sensitive, but presumably all of it is sensitive to memory leaks and use-after-free bugs, I'd say that unique_ptr is actually more important than all this avoiding-copies stuff we've been talking about. But unique_ptr is so convenient, I sometimes see people using it everywhere, even when a plain value type would do. I call this the "I can write Java in any programming language" antipattern.

For example, instead of writing

void foo(unique_ptr>> v);

why not just write

void foo(vector> v);

? vector is already efficiently movable, and in fact vector> is not copyable (because a vector is not copyable if its elements are not copyable). So the unique_ptr buys you nothing other than some extra typing and maybe a cache miss.

Similarly, instead of writing

void foo(const unique_ptr& bar);

why not write

void foo(const Bar* bar);

The former is annoying because someone can only call your function if the data is actually stored in a unique_ptr somewhere. But maybe it's stored on the stack, or maybe it's in a raw pointer for some reason. The former form is kind of mean to your callers.

One last antipattern: Instead of writing

void foo(unique_ptr&& ptr);

consider simply

void foo(unique_ptr ptr);

Because unique_ptr is move-only, these are almost the same from the perspective of a caller. (Why?) The main difference is that, if you std::move the argument to the second foo, you know that the value afterwards is null. Whereas with the first one, the value after foo completes could be anything. I prefer the second form because its behavior is more predictable.

You can extend this argument to types which are both movable and copyable; e.g. if you're writing a function which needs to take ownership of a string, prefer foo(string s) to foo(string&& s). Both let you avoid a copy if you're passing a temporary, but the former is a more flexible API, as it will happily make a copy of a non-temporary. Of course, if you really want to prevent copying, maybe taking the arg by rvalue ref is for you.

Parting words

If you remember nothing else, remember the following two rules. They are enough to recover most of the content of this post.

An rvalue reference is a reference
to a temporary.
std::move is a cast
that lets you treat its argument as a temporary.

The linebreaks here are significant: You can stop reading each of the rules after the first line. Rvalue references are just like normal non-const references (except they always point to temporaries, which regular non-const references never do). std::move is a cast, so does nothing by itself; it just lets you treat something as a temporary.

Once you know about move constructors, it can be easy to tie yourself into knots while trying to avoid all unnecessary copies. Some people even try to avoid unnecessary moves. I'd say, hakuna matata, try not to worry about it. Premature optimization is the root of all evil; just write clean code and profile it if you care. It's been my experience that the rvalue rules often make the code you'd naturally write pretty fast, anyway.

If you want a more rigorous take on this material, I recommend Thomas Beckner's article. In particular, he covers perfect forwarding and reference collapsing, which are useful for library classes. Beckner also has a good article on auto and decltype, if you're interested.

http://cppreference.com is also your friend. It's dense, but it has all the rules, and I've found learning to read that stuff to be well worth the effort.

Good luck, and happy coding!

Thanks to Andrew Hunter and Kyle Huey for many improvements to this post.

Mac OS 10.9 Browser Benchmarks

2013-06-12T23:00:00-07:00

Apple's page describing Mac OS 10.9 contains a comparison of the performance of Safari, Chrome, and Firefox. I've reproduced it below since I expect Apple will take the page down soon.

The data is bunk, of course. SunSpider is a terrible benchmark, and without workloads specified for the memory usage and power consumption measurements, the results mean nothing.

(In case you're wondering, the footnotes referenced under the graph specify the hardware and browser versions used. They also say that "Performance will vary based on system configuration, network connection, and other factors," which reminds me of the first frame of this xkcd.)

But to focus on the crappy benchmarking here would miss the point: Did you notice that Safari's memory usage gets equal billing with its JS speed? That's amazing! Just a few years ago, nobody talked about memory usage. I'd like to think we (and in particular Nick) can take some credit for that.

MemShrink process, weeks 98-102

2013-05-31T20:45:00-07:00

Nicholas Nethercote is on vacation, so this MemShrink report is a guest post written by Andrew McCreight.

MemShrink process, weeks 98-102

May has been a quiet month for MemShrink, as vacation season begins, but one very important fix landed, as well as the fix of a new leak.

Brian Hackett took advantage of the switch to the Baseline JIT compiler to delay the analysis of JS scripts until after they have been compiled by the Baseline compiler. This reduces memory usage for scripts that are only run a few times, which is fairly common. On desktop, this reduced startup resident memory by 14mb, which is about 10%. On Fennec, the reduction was 2mb. That 14mb reduction entirely eliminates the difference in memory between startup and 30 seconds later. This was a MemShrink P1.

Andrea Marchesini fixed a leak involving DOMError and IndexedDB found by Jesse Ruderman. Fuzzing (automatically generating and running test cases) is an important approach to finding leaks in newly landed code, from situations Firefox developers did not think to test.

MemShrink process, weeks 97-98

2013-05-08T20:45:00-07:00

Nicholas Nethercote is on vacation, so this fortnight's MemShrink report is a guest post written by Andrew McCreight.

Images

Joe Drew made it so that on B2G images being displayed on the active page aren't locked, allowing them to be discarded when there is memory pressure. This is important to display image-heavy pages like Pinterest on B2G without OOM crashes.

Andreas Gal added a preference to control the size of the canvas image cache. The use of canvas with large images was causing OOM crashes for the B2G Gallery app. The limit is set to 10mb for B2G, and remains unlimited on desktop.

Leak fixes

Scoobidiver noticed a large increase in empty crashes on Firefox 23. Previously, these have been identified as being caused by leaks of virtual address space. Scoobidiver and Benjamin Smedberg investigated, found a possible culprit, and saw that the number of empty crashes went down to its previous level, or even lower, after some followup work for that bug was landed.

Timothy Nikkel fixed a leak caused by some recent changes to reflow-on-zoom.

Justin Lebar investigated a leak involving long-running automated testing on B2G and determined that it was the same layers leak previously found and fixed, but not backported to B2G18, by new contributor Christophe Mourand.

Randell Jesup fixed a WebRTC leak.

Miscellaneous

Brian Hackett reduced the memory usage of IonMonkey compilation, which should help on things like JS games that intensively run a lot of JS.

Nicholas Nethercote wrote a patch to change how decommitted GC arenas are shown in about:memory. Decommitted arenas don't actually use physical memory, so they need to be accounted for differently.

Boris Zbarsky reduced the memory usage of CSS in certain cases.

Beware of renice

2013-04-11T20:20:00-07:00

In B2G, we try to dynamically change processes' CPU priorities depending on how important we think the process is. For example, processes in the foreground get a higher CPU priority.

I spent the past few days tracking down a bug where our mechanism for changing the CPU priorities didn't seem to be working properly. Thanks to Gabriele Svelto, we figured out today what was going on.

The issue is that the setpriority syscall on Linux does not actually set a process's priority, as you might be led to believe from its man page. Instead, setpriority changes the priority of a process's main thread. It leaves the priority of the process's other threads unchanged. I haven't tested, but from reading the kernel sources, the nice syscall should behave the same way.

Maybe this gotcha is common knowledge, but I was certainly caught unawares.

Experiment (Linux-only)

Here's a simple demonstration of the issue using this testcase. Be warned that this is a Linux-only test, although it's not hard to fix if you're intent on trying it on Mac OS. (Spoiler: renice on Mac OS sets the priority for all threads in the process.)

$ cd $(mktemp -d)

$ curl https://gist.github.com/jlebar/5367521/raw/849b6f5c58d054cf10568ac1f3902ab679f9db4f/gistfile1.cpp > test.cpp
$ g++ -O2 -lpthread -lrt -lm test.c -o test
$ taskset 1 ./test
tid 1399: Finished iteration in 1.65s
tid 1400: Finished iteration in 1.64s

Notice that the two threads are running at the same speed. Now let's change the priority. renice simply calls setpriority on the given pid, so we can use that.

# In a separate terminal
$ renice 10 $(pidof test)

# In the first terminal
tid 1399: Finished iteration in 8.50s
tid 1400: Finished iteration in 0.90s

Now notice that the spawned thread (tid 1400) is running almost twice as fast as it was, while the main process (tid == pid == 1399) is running much slower.

This does not entirely prove my point. All I've shown so far is that renicing the main process lowers the priority of the main thread more than it lowers the priority of the other thread. But it's still possible that both threads' priorities were lowered.

This is a simple hypothesis to test: Just spawn a new process that spins the CPU. If the spawned thread is actually running with niceness 0, it should share roughly half of the CPU with the new process. On the other hand, if renice affected the spawned thread , we'd expect to see the new process get more than half of the CPU.

# In a separate terminal
$ taskset 1 yes > /dev/null

# In the first terminal
tid 1399: Finished iteration in 16.95s
tid 1400: Finished iteration in 1.83s

Thread 1400's time to finish an iteration doubled, from 0.9s to 1.8s, indicating that it's sharing the CPU equally with yes. So we conclude that the renice had no effect on the thread.

I tried a similar procedure on my Mac. Mac doesn't have any way to pin a process to a CPU, so I launched a bunch of programs each spinning the CPU and then reniced the main process. After renicing the two threads contined to run at the same speed—just as things should work!

Conclusions

I take a few lessons away from this.

The setpriority(2) and nice(2) man pages on Linux should mention this gotcha.
renice(1) should probably do the right thing and modify the priority of all of the process's threads, or at least it should give you that option.
If you're using setpriority to modify a thread's priority, you are setting its absolute priority on the system, not its priority relative to its process's priority. Therefore you probably want to increase or decrease the thread's priority relative to its initial value, instead of setting it to an absolute value.
It's scary to think about how much of Linux was designed before multithreaded programs were common. You'd never implement things this way today.

Gabriele dug up a thread on LKML where it was suggested that setpriority on a pid should modify all threads' priorities, but this was rejected because the kernel maintainers were afraid this would break userspace programs which rely on the current behavior.

I'm not so brave as to try to write new functions in glibc that do what we want here (e.g. modify the priority of all threads relative to the main thread's new priority), but at least we can fix NSPR. :)

GNU grep is 10x faster than Mac grep

2012-11-28T00:30:00-07:00

If you're on a Mac and you ever use grep, do yourself a favor and do

$ port install grep

That's likely all you need to do to install and make default GNU grep, which on my machine is a ~10x speedup over the grep that ships with Mac OS.

To wit, here are some benchmarks on my git clone of mozilla-central, with a warm disk cache.

Before:

mozilla-central$ which grep
/usr/bin/grep
mozilla-central$ time git ls-files -z | xargs -0 grep foobar > /dev/null
real 0m19.583s
user 0m18.853s
sys  0m0.722s

After:

mozilla-central$ which grep
/opt/local/bin/grep
mozilla-central$ time git ls-files -z | xargs -0 grep foobar > /dev/null
real 0m1.386s
user 0m0.754s
sys  0m0.613s

Of course in this particular case I could use git grep, which doesn't use the system's grep. It's about the same speed as GNU grep:

mozilla-central$ time git grep 'foobar' > /dev/null
real 0m1.470s
user 0m0.956s
sys  0m1.845s

/usr/bin/grep -V says that it's FreeBSD grep version 2.5.1. I don't know if the grep FreeBSD actually ships is similarly slow.

Custom telemetry dashboards

2012-06-04T20:30:00-07:00

I just found out today that the Metrics front-end folks implemented one of my oldest feature requests for telemetry evolution: You can now create "dashboards" with multiple plots and link to them.

I've created a dashboard with some memory-related plots I care about at http://bit.ly/memorytelemetry.

I'll try to keep that link up to date so you can bookmark it and always see an interesting set of graphs.

You can create your own dashboard by pulling up Telemetry Evolution, clicking the gear at the top right of the graph, and clicking "duplicate".

A ghost story

2012-05-30T12:50:00-07:00

Once upon a time, I wanted to add telemetry for counting the number of zombie compartments present in an instance of Firefox.

The trouble is, it's hard to write code which can tell whether a compartment is a zombie. Our manual procedure for finding zombies involves closing most of your open tabs. But of course we can't close all your tabs every time we want to record the number of zombies for telemetry.

This is a story of the creature which solved that problem, and how we used it to verify a large bugfix.

A false start

My solution to the zombie-identification problem was to give up on it.

Instead, I realized that (almost?) all compartment leaks are due to leaks of DOM windows. So perhaps we could identify leaked windows instead of leaked compartments.

I quickly gave up on that, too. It's hard to tell for sure whether a window is leaked. For example, suppose a page does

window.my_popup = window.open('http://mozilla.org');
window.my_popup.close();

The mozilla.org window will now live until we close the top-level page (or until the page clears window.my_popup).

This situation is difficult to distinguish from the case when Firefox chrome JS code (either baked into Firefox or from an add-on) holds a reference to the mozilla.org window. But in the first case, it's (potentially) a bug in a page, while in the second case, it's a bug in Firefox. We need to distinguish between these cases in order to have a reasonable notion of "leaked window".

My next realization was that I could get somewhere by expecting less of myself. Enter ghost windows.

Ghost windows

If my goal is to collect telemetry on the number of leaked windows, I don't have to identify every single leak with zero false positives. If I were instead to capture the "obvious" leaks with a relatively low false-positive rate, that would still be an interesting number.

Ghost windows are an attempt to capture "obviously" leaked windows. We say that a window is a ghost if it meets three criteria:

The window's docshell is null, which basically means it is not part of any tab.
All other windows from the same top-level origin also meet criterion (1) above. (That is, if you have a tab open to http://mozilla.org, then a window from http://people.mozilla.org will never be a ghost.)
The window has met the two criteria above for at least 60 seconds.

The experienced zombie hunter may notice that these are basically the same criteria that one uses when looking for zombie compartments by hand.

Criterion (3) is in place because windows don't get destroyed immediately once they become garbage; it takes potentially multiple runs of the garbage and cycle collectors to destroy a window, and we don't want to flag a window as a ghost until we believe it's survived a sufficient number of collection attempts.

Criterion (2) tries to exclude the case when a page keeps a window alive, as in the window.my_popup example above. A page can keep a window from any origin alive, but my hope was that most of the time, pages would touch windows only from the same top-level origin. (For example, mail.google.com might keep alive plus.google.com, but hopefully not bing.com.)

This is a crude heuristic, but it seems to work well enough: I get few false positives in practice.

Notice that criterion (2) means that a ghost window can appear and disappear as you browse. If you have a leaked nytimes.com window, it will disappear as a ghost window as soon as you open a new tab to nytimes.com. It's this behavior that suggested the name "ghost". Again, this is OK because the goal is just to capture the obviously leaked windows.

One final note: You can view the list of ghost windows in about:compartments. There, we waive the 60-second waiting period, because loading about:compartments triggers a set of GC/CCs which should collect all garbage windows.

So now that we have ghost windows, let's use them.

Telemetry or: How I Learned to Stop Worrying and Love Add-Ons

You may have heard that Kyle Huey landed a big add-on leak fix for Firefox 15. We believe that his patch eliminates the vast majority of window leaks due to add-ons, and in our experience, leaked windows are the most common and the most severe type of add-on leak.

You may also have heard that Kyle's fix caused new leaks with add-ons built against old versions of the Jetpack SDK, and that we later fixed this problem.

I wanted to verify whether these changes actually had the effect we expected, and thankfully, I'd landed telemetry for ghost windows a few weeks before, so we could observe how his patches changed this measure.

Here's the plot. (This is data from Firefox nightlies on Windows.)

There are three points of interest here:

4/26 — Kyle's first patch landed.

Despite the fanfare surrounding its landing, this was actually the smallest change of the three. There's little to no change in the mean number of ghost windows (the blue dots), but there is a drop in the 75th percentile (top of the black vertical lines).
5/4 — This was an unexpected drop. My best guess is that it was caused by bug 751561, because that's the only change in the changeset range which seems plausible.

We don't have a good explanation for why this change affected the number of ghost windows.
5/10 — Kyle's second patch landed, fixing the jetpack issues.

(If you want to look at the latest data, go here, sign in with any ~~BrowserID~~ Mozilla Persona account, click "telemetry evolution", then select GHOST_WINDOWS in the drop-down.)

This data is surprising, and not just because of the drop on 5/4 that we can't explain. I expected a big drop after Kyle's first patch and a small drop after his second one, but we observed the opposite.

I don't know exactly what that means, but I'd guess either a lot of our nightly testers were using add-ons built with old versions of the Jetpack SDK, or the Jetpack fix also fixed other problems we weren't aware of.

I think we can conclude from this data that we're in a much better state now with respect to ghost windows than we were. This data is also evidence that counting ghost windows actually measures leaks in Firefox.

Is ghost windows the right metric?

I think it's reasonable at this point to ask: Why bother with telemetry for ghost windows? Couldn't I — in fact, shouldn't I — have looked at actual memory usage, in the form of resident set size (RSS)?

Of course RSS is an important metric, but there are a few reasons ghost windows are a reasonable thing to look at:

RSS is noisy.

There may have been a drop in RSS numbers since we landed these fixes, but the data is noisy enough that I couldn't use the RSS data to prove that we fixed anything. In contrast, the trend in the ghost windows data is clear.
Leaked windows are harmful even when they don't perceptibly increase memory usage.

In particular, leaked windows can cause cycle collector times to increase, because as soon as a window's docshell becomes null, the cycle collector thinks it may be garbage and tries to collect it. In its vain effort to collect the leaked window, the CC traces every DOM node attached to the window, causing longer CC pauses.

The lesson here is that, for verifying fixes, specific measurements are often more useful than general ones. Since you probably want at least a week's worth of baseline data, verifying fixes can take some planning!

Remaining work

Ghost windows were developed before compartment-per-global. With CPG, it may be possible to remove ghost window criterion (2), which would be great.

We have one other add-on related bug on our plate at the moment. It turns out that our FUEL API never frees most of the objects it creates. And under some circumstances, it creates a lot of objects.

There are patches in the bug awaiting review; I hope we'll have this fixed for Firefox 15 or 16.

The fact that FUEL has been leaking for a long time (since 2007?) suggests to me that there may be other dark corners of Firefox which are similarly problematic.

But the bottom line is, we're making progress. I'm hopeful that many users with leaky add-ons will find Firefox 15 to be a noticeable improvement.