In this post I define the concept of "hardware completeness" and show its relevance to the design of programming languages and compiler IRs.
Let's say a programming language or compiler intermediate representation (IR) is "hardware-complete" on hardware platform X if you can use it to generate code that's equivalent to any valid machine code program for platform X. For our purposes, "equivalent to" measures both functionality and performance.
Some examples of hardware-complete languages are:
Of course, you can't always get 100% of the performance of handwritten machine code out of, say, a C program—at least, not without replacing your whole program with a giant block of inline asm (although I'd argue the fact that you can do that is part of what makes C hardware-complete). But philosophically, hardware-complete languages try to make it possible to get close to "100% performance", and if you can't, it's probably a bug.
In the world of machine learning, we have many languages and compiler IRs which are not hardware-complete. Some examples of hardware-incomplete languages for ML are:
These are incomplete because there are many valid CPU and GPU programs that you cannot write using these languages. Even if the language is Turing-complete so you can always get functional equivalence, it would often come at the cost of a huge slowdown compared to handwritten assembly (or C).
Using the word "incomplete" might make it sound like a bad thing, but hardware-incomplete IRs have many benefits, including:
Incomplete languages usually target a specific domain (e.g. ML) so (hopefully!) it's more pleasant to write domain-specific code in one of these languages than a complete/general-purpose language.
Incomplete languages usually have restrictions that make it easier to do completer optimizations.
It's easier to make incomplete languages portable to different hardware, because you're not aiming to generate any valid program on the target hardware. (For example, LLVM IR is not actually portable; you can't take LLVM IR generated for x86 and expect to run it on a GPU.)
But one negative property about incomplete IRs is, it's hard to translate from one to another. That's because in order to translate from language A to B, all machine code programs that can be generated by A must be generateable by B, and with roughly-equivalent performance. This works by definition if B is complete, and usually doesn't work if B is incomplete. And even if it works at any one point in time, incomplete languages are always changing (because you are always finding new programs you want to generate), and it's vanishingly unlikely that they will change in the same way and at the same pace.
For example, although XLA, ONNX, and TensorRT all try to cover roughly the same subset of valid programs on nvidia GPUs, in practice they do not overlap perfectly, so it's not possible to translate an arbitrary XLA program to ONNX or TensorRT.
Another fact is, because XLA, ONNX, and TensorRT all try to cover roughly the same subset of GPU programs, they look very similar. So similar you might be able to convince yourself that you can translate between them. And you'll be able to get 90% of the way there before you understand the differences well enough to realize how doomed you are.
Of course, if a programming language is designed around the limitations of an incomplete IR, that can work. For example, JAX (originally, "Just Another XLA") is designed around the limitations of XLA, so it can lower to XLA without issue. But it would be hard to lower JAX to a different incomplete IR with slightly-different limitations. Indeed many incomplete IRs have their own dedicated language. For example, the Triton language is tightly coupled to Triton's IR.
The above was just definitions and facts. What can we conclude from it?
Upshot 1: As an ML compilers person, I frequently have the following conversation with hardware people.
Them: We have developed a new IR and compiler for machine learning. We will take your IR (XLA, Triton IR, whatever) and lower it to our new IR. Then all of your programs will get great performance on our hardware.
Me: Thanks but no thanks. I won't be able to lower all Triton programs to your IR.
Them: I don't understand, you lower XLA/Triton programs to LLVM IR today, right? What's the difference?
The fundamental problem is that the hardware person is asking me to lower my incomplete IR to another incomplete IR. For the reasons above, this is unappealing and probably won't work.
Upshot 2: If you're designing an incomplete IR, I'd say you probably should design a programming language to go with it. This way you can enforce your IR's limitations at the level of user code, instead of having performance and functionality cliffs that are unpredictable to your users and that depend on which compiler backend they're using. I think this tight coupling between language and IR may be one of the reasons JAX and Triton have been so successful.
A corollary of this is that your incomplete IR probably won't be relevant unless its associated programming language is successful. At which point, you should probably think about the programming language first and the IR second.
Upshot 3: This also explains my skepticism of "ML compiler hourglass diagrams". Behold.
In all of these cases, the narrow part of the hourglass is an incomplete language, so it has all the problems I've described above.
Thanks to Kyle Huey and Mehdi Amini for feedback on drafts of this post.
It took me years before I understood exactly what a convolution is in the machine learning context. I understood the simple "filter moving over an image" convolution, but I couldn't figure out how to translate that into the operation that's performed by machine learning.
I read posts that used complex 4D diagrams (3D plus animation!) or 6-deep
nested for
loops. Nothing against that if it makes sense to you, but it
never did to me.
It turns out there's a simple transformation from a basic convolution to an ML convolution. I haven't seen it explained this way anywhere else, so this is what I want to show in this post. Maybe it will help you like it helped me.
As a bonus, I'll show how this perspective makes it straightforward to observe the fact that 1x1 ML convolutions are just matrix multiplications.
To review, a "basic convolution" is a function which takes as input two 2D arrays of real numbers:
We expect the filter to be smaller than the image.
The basic convolution drags the filter over the image. At each point of overlap, we produce one output element by multiplying the values in the filter by the values in the image that are "beneath" them and adding up these products.
When the filter is in the top left of the input image, this process produces the top-leftmost output point. If the filter moves (say) one element rightward, we now generate the output point one element to the right from the top-left.
We thus produce an output image of size \( (H_o, W_o) \) (output height/width). The output is a little smaller than the input image because the kernel can't exceed the bounds of the input. (You can fix this by padding the input with 0s around the edges.)
To make up some notation describing the dimensions of the input/output arrays, we might write
\[ \operatorname{ConvBasic}( (H_i, W_i),\ (H_f, W_f) ) \rightarrow (H_o, W_o). \]
If you're not familiar with all this, check out this explainer, up to but not including the section entitled "the multi-channel version".
The basic convolution is simple enough. But an "ML convolution" is a different beast.
An ML convolution is an operation which operates on 4D arrays of real numbers.
We can write this in our notation from above as
\[ \operatorname{MLConv}( (N, H_i, W_i, C_i),\ (H_f, W_f, C_i, C_o) ) \rightarrow (N, H_o, W_o, C_o). \]
Yikes, that's a lot of dimensions. What do they all mean, and how do they interact?
That's what I want to break down in this post. We'll take \( \operatorname{ConvBasic} \) and add one dimension at a time, until we have the full \( \operatorname{MLConv} \) function.
First, let's add input channels to the image.
Consider the elements of the image. Instead of them being real numbers, suppose they're vectors in \( \mathbb{R}^n \). For example, maybe your image is RGB; in this case, \( n=3 \). Or maybe your image is the result of an intermediate layer in a convolutional net; in this case, maybe \( n=128 \).
How do we "multiply" the elements of the image by the filter? One option is to let the elements of the filter also be vectors in \( \mathbb{R}^n \). Now when we multiply an element of the input by an element of the filter, we take their dot product, giving us a real number. Then like before, we add up all these dot products to get the output element, which is also a real number.
As you might have guessed, \( n \) is called the number of "input channels". It corresponds to dim \( C_i \).
So now we have a function which takes as input two 2D arrays, of size \( (H_i, W_i) \) and \( (H_f, W_f) \), where each element is a vector in \( \mathbb{R}^{C_i} \). Or equivalently, we take two 3D arrays of real numbers, of dimensions \( (H_i, W_i, C_i) \) and \( (H_f, W_f, C_i) \).
This operation still outputs a 2D array of real numbers of size \( (H_o, W_o) \), where the output height/width are a bit smaller than the input.
In our notation, this is
\[ \operatorname{ConvInputChannels}( (H_i, W_i, C_i),\ (H_f, W_f, C_i) ) \rightarrow (H_o, W_o). \]
Above, our output image is only 2D. Let's add the \( C_o \) ("output channels") dimension to make it 3D.
Suppose we create \( C_o \) independent filters and repeat the process above once for each filter. This results in \( C_o \) 2D output images.
Now the output is a 3D array of dimension \( (H_o, W_o, C_o) \). The filter also gains an additional dimension, \( (H_f, W_f, C_i, C_o) \).
Notice that we take the dot product of each element of the input image, a vector in \( \mathbb{R}^{C_i} \), with \( C_o \) different vectors from the filter. You can think of this as taking a repeated dot product, or equivalently as a matrix-vector multiplication.
In our notation, we now have
\[ \operatorname{ConvInOutChannels}( (H_i, W_i, C_i),\ (H_f, W_f, C_i, C_o) ) \rightarrow (H_o, W_o, C_o). \]
The only remaining dimension to handle is the "batch" dimension \( N \).
To add this, we have \( N \) independent input images which generate \( N \) independent output images. This is the batch dimension.
Now each output element is a \( (C_o, N) \) matrix, the sum of matrices formed by multiplying a \( (C_o, C_i) \) filter element by a \( (C_i, N) \) matrix.
As before, we repeat this process for each of the \( H_o, W_o \) output elements.
We now have all four dimensions, so we can properly call this \( \operatorname{MLConv} \):
\[ \operatorname{MLConv}( (N, H_i, W_i, C_i),\ (H_f, W_f, C_i, C_o) ) \rightarrow (N, H_o, W_o, C_o). \]
Notice that the only difference between these four functions is in how we "multiply" elements.
That's all an ML convolution is!
Now for our bonus fact. Suppose the filter's height and width are 1 — i.e. \( H_f = W_f = 1 \). I claim that this convolution is just a matrix multiplication.
First, observe that the batch, image height, and image width dimensions are interchangeable in a convolution with a 1x1 filter. That is, because there's no "mixing" along the height/width, we can reshape a \( (N, H_i, W_i, C_i) \) input into \( ( N \times H_i \times W_i, 1, 1, C_i ) \), do the conv, and then reshape back to the desired output shape.
So it's sufficient to show that a conv with 1x1 filter and input height/width 1 is a matmul.
But look at the diagram for "Batch Dimension" above. If the input height and width are 1, then there's only one output element. And if the filter size is 1x1, then there's only one term on the RHS, so the one output element is computed as a matmul. Thus, the whole operation is equivalent to a single matrix multiply! □
Thanks to George Karpenkov, Marek Kolodziej, Blake Hechtman, Kyle Huey, and Alexander Zinoviev for their feedback on drafts of this post.
rvalue references are one of the most powerful new features in C++11. But they're also, in my experience from teaching classes at work and reviewing code, one of the most difficult to grok.
I've come to think this isn't so much because rvalue references are particularly complex (by C++'s standards, anyway). Rather, much of the difficulty seems to stem from the fact that much of their workings have no syntax attached. If you don't know the implicit rules, rvalue references just look like magic.
In this post, I'm going to try to motivate and explain rvalue references using practical examples. My hope is that this will demystify both how they work, and why they were designed this way.
Note to language lawyers: I'm intentionally playing fast and loose with some parts of the standard here. "Practical" is the key word in the title. :)
Imagine you're writing a simple stack class, based on a singly-linked list.
template<typename T>
class Stack {
private:
struct Node {
T elem;
Node* next;
};
Node* first;
public:
void push_front(const T& t) {
Node* n = new Node();
n->elem = t;
n->next = first;
first = n;
}
// pop_front, constructors omitted.
};
Now suppose you use your class as follows.
Stack<string> s;
s.push_front(get_string());
Let's break down what this does:
get_string()
creates and returns a string. This involves malloc
'ing a
buffer on the heap and writing some data into it.push_front()
copies the given string into a newly-allocated Node
(n->elem = t;
). This involves malloc
'ing a buffer for the new string's
data and memcpy'ing the original string's data into the new buffer.push_front()
returns, and we destroy the original string, free
'ing its
buffer.This is frustratingly inefficient. You and I know that the string returned by
get_string()
is going to go away right after the call to push_front
. Why
do we need to malloc
a new buffer and free
the old buffer, when we could
simply steal the old buffer for our own purposes? This would let push_front
avoid calling malloc
, free
, and memcpy
entirely, for a big potential
speedup.
Let's first look at how we'd solve this inefficiency without C++11.
We want to let users of our class get this faster behavior, so before C++11, we might write a new member function.
void push_front_destructive(T& t) {
Node* n = new Node();
using std::swap;
swap(n->elem, t);
n->next = first;
first = n;
}
(This business with using std::swap
is an ADL trick; it lets us call a
specialized version of swap
if one exists for the type T
, otherwise it
calls std::swap
.)
This is great, it runs in O(1) time, and steals the pointer just like we want.
Now we try to use our stack like we did originally.
Stack<string> s;
s.push_front_destructive(get_string()); // ERROR
But this is a compile error! The problem is that C++ does not let you bind a
temporary value (get_string()
) to a non-const reference (t
). One of the
creators of C++ explains that this decision was made to prevent
certain kinds of bugs. We might disagree, but whatever, it's the rule, and it
gets in our way here.
We can still make this work, but it's a bit ugly:
Stack<string> s;
string str = get_string();
s.push_front_destructive(str);
This is OK because now we're asking the the non-const reference arg in
push_front_destructive
to bind to the non-temporary str
.
(You may have noticed that I haven't defined what a "temporary" actually is.
The exact definition is...complicated. For our purposes, it's a
sufficient model to say that something is a temporary if you can only use it
once. So for example, get_string()
is a temporary, because you can do
get_string().c_str()
or get_string().length()
or whatever, but after you do
that one thing, you can't go back to the original object and do a second thing.
In contrast, if I have a string x
, that's not a temporary because I can
touch x
multiple times.)
This is sort of the best we can do without C++11. It's not awful, but there
are a few things that could be improved. For one thing, having to assign
get_string
into a named variable is ugly. In addition, we have to remember
to call push_front_destructive
rather than the regular push_front
.
But from the perspective of C++'s design philosophy, the worst part may be that
we still do unnecessary work. It's not as much as before, but we have to
default-construct a T
before swapping into it, and we have to do both sides
of our swap, even though we only care about the final value of n->elem
. Both
of these steps can be expensive on arbitrary objects.
So here's where C++11 comes in to solve our problems. C++11 introduces a new reference type, called the rvalue reference. It is exactly like a non-const reference, except it's only allowed to refer to temporaries. (Recall that a regular non-const reference can refer only to non-temporaries.)
Again, to emphasize: An rvalue reference is basically the same as a regular (non-const) reference. The only difference is in the referent: A regular non-const reference always points to a non-temporary, while an rvalue reference always points to a temporary.
Using just rvalue references and nothing else new, we can add an overload of
push_front
that accepts only temporary objects. This overload can
destructively modify its inputs at will, because (the assumption is) they're
going away anyways.
void push_front(T&& t) {
Node n = new Node();
using std::swap;
swap(n->elem, t);
n->next = first;
first = n;
}
(An rvalue reference is spelled &&
. The double-ampersand is a single unit;
don't try to read it as "reference-to-a-reference".)
The body of our new function is exactly the same as push_front_destructive
,
because rvalue references behave exactly like regular references.
With this new overload, we can now write code that uses our stack like the following.
Stack<string> s;
s.push_front(get_string()); // Swaps
string str = ...;
s.push_front(str); // Copies
We now have two overloads of push_front
: A version which takes a const
reference (which can bind to a temporary or a non-temporary) and doesn't modify
its input, and a version which takes an rvalue reference (which can only bind to
a temporary) and swap
s its input. The first call to push_front
passes a
temporary, so it can bind to either overload. The compiler picks the rvalue
ref one, because it's more specific. The second call to push_front
passes a
non-temporary, so it can only bind to our original function, which copies.
Consider again the most recent use of our stack class above. Suppose we wanted
to change it so we invoke the new rvalue ref overload for the second call to
push_front
, because we don't care about the value of str
after push_front
completes. C++11 adds a mechanism for us to do this: std::move
. It looks
like this.
string str = ...;
s.push_front(std::move(str)); // Swaps
All std::move
does is imbue its argument with temporaryness. I know,
it's confusing, because "move" is a verb; it sounds like it should do
something. But in fact it's just a cast, just like const_cast
or
reinterpret_cast
. Like those casts, std::move
just lets us use an object
in a different way; on its own, it doesn't do anything.
Now that we have rvalue references and std::move
, we've solved some of the
problems with our original push_front_destructive
. We no longer have to name
our temporaries and remember to call push_front_destructive
in order to get
the faster behavior. This is great! But our new overload of push_front
still
isn't as efficient as it could be. We're still default-constructing a T
, and
we're still swap
'ing, even though we only care about one side.
The solution? We'll apply this same technique of overloading functions that take const references, but this time to constructors.
Consider std::string
. It's always had a copy constructor, which takes its
argument by const reference. Suppose we added a new constructor, which takes
its argument by rvalue reference. It might look like this.
class string {
public:
string(const string& other); // Copy constructor, exists pre C++11
string(string&& other) { // Move constructor, new in C++11
length = other.length;
capacity = other.capacity;
data = other.data;
other.data = nullptr;
}
private:
size_t length;
size_t capacity;
const char* data;
};
This new constructor is called a move constructor. Our implementation
takes a temporary string, steals its pointers, and copies its non-pointer
members. Note that we have to set other.data = nullptr
, otherwise when
other
is destroyed, it will free
its data
pointer, and our data
pointer
will then be stale. (We might also have to modify string
's destructor to
handle a null data
pointer.)
Now when we create a string, we can play the same overloading game we did with
push_front
. If we pass a non-temporary, we call string
's copy constructor
(which malloc
's a new buffer and so on), but if we pass a temporary, we call
the move constructor.
string a(get_string()); // move constructor
string b(a); // copy constructor
string c(std::move(b)); // move constructor
Note that if string
didn't have a move constructor, the line string
c(std::move(b))
would just call the copy constructor, because its const
string&
arg happily binds to temporaries. But everything in the C++ standard
library has been updated to have move constructors where appropriate.
The addition of move constructors is a particularly cool change to the standard
library. The line string a(get_string());
could have been written pre-C++11,
and it would have resulted in an O(n) copy. But now, just by compiling our
code with -std=c++11
, it's an O(1) move instead! Getting asymptotic speedups
in your code is is impressive for a backwards-compatible language change.
In what you're probably noticing is a pattern, this same trick can be applied
to the assignment operator, operator=
. Here's what that looks like:
class string {
public:
string& operator=(const string& other); // Copy assn operator, pre C++11
string& operator=(string&& other) { // Move assn operator, new in C++11
length = other.length;
capacity = other.capacity;
delete data; // OK even if data is null
data = other.data;
other.data = nullptr;
return *this;
}
};
string a, b;
a = get_string(); // Move assignment
a = b; // Copy assignment
a = std::move(b); // Move assignment
Just like the pre-C++11 compiler would try to auto-generate a copy constructor
and copy assignment operator for you, the C++11 compiler will try to
auto-generate a move constructor and move assignment operator for you. The
rules for when you get one of these are complicated, but the
main one is that if you define a custom copy constructor, the compiler won't
auto-generate a move constructor or move assignment operator for you. You can
use =default
if you want to override this behavior, and you can use =delete
if you want to explicitly prevent the compiler from auto-generating one of these
functions.
In most general-purpose code, you're likely going to interact with rvalue
references pretty much exclusively via move constructors and move operators.
Non-constructor functions like push_front
that take rvalue references are
relatively rare, and mainly used in collections, wrapper classes, and for
perfect forwarding in functions that wrap constructors, such as
std::make_shared
.
After all that explanation, I can finally show you how using move constructors
solves the remaining problems in our push_front
implementation.
So let's come back to our push_front
implementation. To remind you, we
currently have
void push_front(T&& t) {
Node* n = new Node();
using std::swap;
swap(n->elem, t);
n->next = first;
first = n;
}
Our goals are
T
inside the Node
constructor (since, why
bother).(1) is easy, we can use the move-assignment operator, as follows.
void push_front(T&& t) {
Node* n = new Node();
n->elem = std::move(t);
n->next = first;
first = n;
}
The thing to note here is that we do need the std::move
, if we want to call
the move assignment operator, even though t
is an rvalue reference. Recall
that rvalue references are just like regular references; the only thing that's
different is that we know the referent is a temporary. In particular, t
itself is not a temporary!
A surprising consequence of this can be seen in the following code.
void foo(const string&) {}
void foo(string&& s) { foo(s); }
foo("bar");
This is not infinite recursion, because s
is not a temporary! An rvalue
reference merely points to a temporary; it is not a temporary itself. This is
weird, I know, but it would be weirder if it were the other way around.
(Exactly why is left as an exercise to the reader.)
Okay, we have (1). What about (2)? To avoid default-initializing Node::elem
,
we need to define and call a constructor on Node
that moves t
. It might
look like this.
struct Node {
Node(T t, Node* next) : elem(std::move(t)), next(next) {}
T elem;
Node* next;
};
void push_front(T&& t) {
first = new Node(std::move(t), first);
}
This still isn't perfect, as we actually move-construct a T
twice: Once inside
push_front
when we call the Node
constructor, and again within the Node
constructor. In general this isn't a big deal, and the compiler can likely
optimize it away. But if you really cared, you could modify Node's constructor
to take a T&&
. You would still need the std::move
!
(You couldn't avoid writing this constructor on Node
by using
aggregate initialization, as that always copies its elements. I have no
idea why that choice was made.)
And that's it, we're done! Easy, right? :)
Well, we're not quite done. No discussion of rvalue references would be
complete without talking about std::unique_ptr
. unique_ptr
is a move-only
class: You can't copy one (because then the pointer it wraps would not be
"unique"). We say unique_ptr
"owns" a pointer, because when the unique_ptr
goes away, it delete
s its pointer.
An idiom you'll probably encounter with unique_ptr
(and other classes, even
ones that are not move-only) is passing unique_ptr
by value to indicate a
transfer of ownership.
void foo(unique_ptr<Bar> bar);
unique_ptr<Bar> my_bar;
foo(std::move(my_bar));
Here we give foo
our copy of my_bar
. foo
now has the only owning copy,
and can do whatever it wants with it (e.g. store it in a global variable
somewhere).
Your first question here might be, what's the value of my_bar
after we call
foo
? The answer depends on what unique_ptr
's move constructor does!
Hypothetically, it could do basically anything. It could leave my_bar
in a
state such that you get undefined behavior if you call any functions on it.
The only requirement is that we can safely run my_bar
's destructor.
Some classes are more constrained in their behavior after being moved from.
According to the spec, a unique_ptr
is null after it's moved from.
std::string
is in a "valid, but unspecified state", meaning, you can safely
call functions on the string, but who knows what you'll get.
Another common question I get is, if std::move
isn't where we move my_bar
,
where the heck does the move happen? My model is as follows. Before we start
running foo
, we execute a function prelude, which sets up foo
's arguments.
In particular, for each parameter P in foo
's declaration, we construct a new P
inside the scope of our call to foo
, using the argument passed to the call.
So in the case above, where foo
takes a unique_ptr<Bar> bar
parameter by
value and we do foo(std::move(my_bar))
, it's as though foo
executes
unique_ptr<Bar> bar(std::move(my_bar))
in its prelude. This is where the
move occurs. (Note that if foo
took its parameter by, say, rvalue reference,
then the same call to foo
would result in us doing unique_ptr<Bar>&&
bar(std::move(my_bar))
in the prelude, which is just binding a reference; it's
not a copy or a move.)
Something similar happens when you return a value: In the function epilogue, we
construct a new value in the calling scope using the value you're returning.
There's a wrinkle with C++11, however, which is that when you return foo
, if
foo
is a local variable or function argument, it is implicitly a temporary.
You don't need to return std::move(foo)
to get efficient move construction of
the return value.
In fact, you don't want to return std::move(foo)
, because (hilariously) that
prevents RVO. If your type is efficiently movable, this probably isn't a
big deal. But if your type is not movable, then you end up with a full-blown
copy! (RVO would have elided this copy, making it free.) clang 3.7's
-Wpessimizing-move
will warn you about this footgun. You do still need
std::move
if you're doing something like return std::move(foo.bar)
–
only arguments and local variables, not their members, are implicitly treated
as temporaries in return statements.
unique_ptr
is awesome because it lets you avoid a lot of bugs, and it lets you
make any class efficiently movable. And in fact since most of your code isn't
performance-sensitive, but presumably all of it is sensitive to memory leaks and
use-after-free bugs, I'd say that unique_ptr
is actually more important than
all this avoiding-copies stuff we've been talking about. But unique_ptr
is
so convenient, I sometimes see people using it everywhere, even when a plain
value type would do. I call this the "I can write Java in any programming
language" antipattern.
For example, instead of writing
void foo(unique_ptr<vector<unique_ptr<Bar>>> v);
why not just write
void foo(vector<unique_ptr<Bar>> v);
? vector
is already efficiently movable, and in fact vector<unique_ptr<T>>
is not copyable (because a vector is not copyable if its elements are not
copyable). So the unique_ptr
buys you nothing other than some extra typing
and maybe a cache miss.
Similarly, instead of writing
void foo(const unique_ptr<Bar>& bar);
why not write
void foo(const Bar* bar);
The former is annoying because someone can only call your function if the data
is actually stored in a unique_ptr
somewhere. But maybe it's stored on the
stack, or maybe it's in a raw pointer for some reason. The former form is kind
of mean to your callers.
One last antipattern: Instead of writing
void foo(unique_ptr<T>&& ptr);
consider simply
void foo(unique_ptr<T> ptr);
Because unique_ptrstd::move
the argument to the second foo
, you know that the value
afterwards is null. Whereas with the first one, the value after foo
completes could be anything. I prefer the second form because its behavior is
more predictable.
You can extend this argument to types which are both movable and copyable; e.g.
if you're writing a function which needs to take ownership of a string, prefer
foo(string s)
to foo(string&& s)
. Both let you avoid a copy if you're
passing a temporary, but the former is a more flexible API, as it will happily
make a copy of a non-temporary. Of course, if you really want to prevent
copying, maybe taking the arg by rvalue ref is for you.
If you remember nothing else, remember the following two rules. They are enough to recover most of the content of this post.
An rvalue reference is a reference
to a temporary.
std::move
is a cast
that lets you treat its argument as a temporary.
The linebreaks here are significant: You can stop reading each of the rules
after the first line. Rvalue references are just like normal non-const
references (except they always point to temporaries, which regular non-const
references never do). std::move
is a cast, so does nothing by itself; it just
lets you treat something as a temporary.
Once you know about move constructors, it can be easy to tie yourself into knots while trying to avoid all unnecessary copies. Some people even try to avoid unnecessary moves. I'd say, hakuna matata, try not to worry about it. Premature optimization is the root of all evil; just write clean code and profile it if you care. It's been my experience that the rvalue rules often make the code you'd naturally write pretty fast, anyway.
If you want a more rigorous take on this material, I recommend Thomas Beckner's
article. In particular, he covers perfect forwarding and reference
collapsing, which are useful for library classes. Beckner also has a good
article on auto
and decltype
, if you're interested.
http://cppreference.com is also your friend. It's dense, but it has all the rules, and I've found learning to read that stuff to be well worth the effort.
Good luck, and happy coding!
Thanks to Andrew Hunter and Kyle Huey for many improvements to this post.
Apple's page describing Mac OS 10.9 contains a comparison of the performance of Safari, Chrome, and Firefox. I've reproduced it below since I expect Apple will take the page down soon.
The data is bunk, of course. SunSpider is a terrible benchmark, and without workloads specified for the memory usage and power consumption measurements, the results mean nothing.
(In case you're wondering, the footnotes referenced under the graph specify the hardware and browser versions used. They also say that "Performance will vary based on system configuration, network connection, and other factors," which reminds me of the first frame of this xkcd.)
But to focus on the crappy benchmarking here would miss the point: Did you notice that Safari's memory usage gets equal billing with its JS speed? That's amazing! Just a few years ago, nobody talked about memory usage. I'd like to think we (and in particular Nick) can take some credit for that.
Nicholas Nethercote is on vacation, so this MemShrink report is a guest post written by Andrew McCreight.
May has been a quiet month for MemShrink, as vacation season begins, but one very important fix landed, as well as the fix of a new leak.
Brian Hackett took advantage of the switch to the Baseline JIT compiler to delay the analysis of JS scripts until after they have been compiled by the Baseline compiler. This reduces memory usage for scripts that are only run a few times, which is fairly common. On desktop, this reduced startup resident memory by 14mb, which is about 10%. On Fennec, the reduction was 2mb. That 14mb reduction entirely eliminates the difference in memory between startup and 30 seconds later. This was a MemShrink P1.
Andrea Marchesini fixed a leak involving DOMError and IndexedDB found by Jesse Ruderman. Fuzzing (automatically generating and running test cases) is an important approach to finding leaks in newly landed code, from situations Firefox developers did not think to test.
Nicholas Nethercote is on vacation, so this fortnight's MemShrink report is a guest post written by Andrew McCreight.
Joe Drew made it so that on B2G images being displayed on the active page aren't locked, allowing them to be discarded when there is memory pressure. This is important to display image-heavy pages like Pinterest on B2G without OOM crashes.
Andreas Gal added a preference to control the size of the canvas image cache. The use of canvas with large images was causing OOM crashes for the B2G Gallery app. The limit is set to 10mb for B2G, and remains unlimited on desktop.
Scoobidiver noticed a large increase in empty crashes on Firefox 23. Previously, these have been identified as being caused by leaks of virtual address space. Scoobidiver and Benjamin Smedberg investigated, found a possible culprit, and saw that the number of empty crashes went down to its previous level, or even lower, after some followup work for that bug was landed.
Timothy Nikkel fixed a leak caused by some recent changes to reflow-on-zoom.
Justin Lebar investigated a leak involving long-running automated testing on B2G and determined that it was the same layers leak previously found and fixed, but not backported to B2G18, by new contributor Christophe Mourand.
Randell Jesup fixed a WebRTC leak.
Brian Hackett reduced the memory usage of IonMonkey compilation, which should help on things like JS games that intensively run a lot of JS.
Nicholas Nethercote wrote a patch to change how decommitted GC arenas are shown in about:memory. Decommitted arenas don't actually use physical memory, so they need to be accounted for differently.
Boris Zbarsky reduced the memory usage of CSS in certain cases.
In B2G, we try to dynamically change processes' CPU priorities depending on how important we think the process is. For example, processes in the foreground get a higher CPU priority.
I spent the past few days tracking down a bug where our mechanism for changing the CPU priorities didn't seem to be working properly. Thanks to Gabriele Svelto, we figured out today what was going on.
The issue is that the setpriority
syscall on Linux does not actually set a
process's priority, as you might be led to believe from its man page. Instead,
setpriority
changes the priority of a process's main thread. It leaves the
priority of the process's other threads unchanged. I haven't tested, but from
reading the kernel sources, the nice
syscall should behave the same way.
Maybe this gotcha is common knowledge, but I was certainly caught unawares.
Here's a simple demonstration of the issue using this testcase. Be warned that this is a Linux-only test, although it's not hard to fix if you're intent on trying it on Mac OS. (Spoiler: renice on Mac OS sets the priority for all threads in the process.)
$ cd $(mktemp -d)
$ curl https://gist.github.com/jlebar/5367521/raw/849b6f5c58d054cf10568ac1f3902ab679f9db4f/gistfile1.cpp > test.cpp
$ g++ -O2 -lpthread -lrt -lm test.c -o test
$ taskset 1 ./test
tid 1399: Finished iteration in 1.65s
tid 1400: Finished iteration in 1.64s
Notice that the two threads are running at the same speed. Now let's change
the priority. renice
simply calls setpriority
on the
given pid, so we can use that.
# In a separate terminal
$ renice 10 $(pidof test)
# In the first terminal
tid 1399: Finished iteration in 8.50s
tid 1400: Finished iteration in 0.90s
Now notice that the spawned thread (tid 1400) is running almost twice as fast as it was, while the main process (tid == pid == 1399) is running much slower.
This does not entirely prove my point. All I've shown so far is that renicing the main process lowers the priority of the main thread more than it lowers the priority of the other thread. But it's still possible that both threads' priorities were lowered.
This is a simple hypothesis to test: Just spawn a new process that spins the
CPU. If the spawned thread is actually running with niceness 0, it should
share roughly half of the CPU with the new process. On the other hand, if
renice
affected the spawned thread , we'd expect to see the new process get
more than half of the CPU.
# In a separate terminal
$ taskset 1 yes > /dev/null
# In the first terminal
tid 1399: Finished iteration in 16.95s
tid 1400: Finished iteration in 1.83s
Thread 1400's time to finish an iteration doubled, from 0.9s to 1.8s,
indicating that it's sharing the CPU equally with yes
. So we conclude that
the renice had no effect on the thread.
I tried a similar procedure on my Mac. Mac doesn't have any way to pin a process to a CPU, so I launched a bunch of programs each spinning the CPU and then reniced the main process. After renicing the two threads contined to run at the same speed—just as things should work!
I take a few lessons away from this.
The setpriority(2)
and nice(2)
man pages on Linux should mention this
gotcha.
renice(1)
should probably do the right thing and modify the priority of
all of the process's threads, or at least it should give you that option.
If you're using setpriority
to modify a thread's priority, you are
setting its absolute priority on the system, not its priority relative to
its process's priority. Therefore you probably want to increase or decrease
the thread's priority relative to its initial value, instead of setting it
to an absolute value.
It's scary to think about how much of Linux was designed before multithreaded programs were common. You'd never implement things this way today.
Gabriele dug up a thread on LKML where it was suggested that setpriority
on a pid should modify all threads' priorities, but this was rejected because
the kernel maintainers were afraid this would break userspace programs which
rely on the current behavior.
I'm not so brave as to try to write new functions in glibc that do what we want here (e.g. modify the priority of all threads relative to the main thread's new priority), but at least we can fix NSPR. :)
If you're on a Mac and you ever use grep, do yourself a favor and do
$ port install grep
That's likely all you need to do to install and make default GNU grep, which on my machine is a ~10x speedup over the grep that ships with Mac OS.
To wit, here are some benchmarks on my git clone of mozilla-central, with a warm disk cache.
Before:
mozilla-central$ which grep
/usr/bin/grep
mozilla-central$ time git ls-files -z | xargs -0 grep foobar > /dev/null
real 0m19.583s
user 0m18.853s
sys 0m0.722s
After:
mozilla-central$ which grep
/opt/local/bin/grep
mozilla-central$ time git ls-files -z | xargs -0 grep foobar > /dev/null
real 0m1.386s
user 0m0.754s
sys 0m0.613s
Of course in this particular case I could use git grep
, which doesn't use the
system's grep. It's about the same speed as GNU grep:
mozilla-central$ time git grep 'foobar' > /dev/null
real 0m1.470s
user 0m0.956s
sys 0m1.845s
/usr/bin/grep -V
says that it's FreeBSD grep version 2.5.1. I don't know if
the grep FreeBSD actually ships is similarly slow.
I just found out today that the Metrics front-end folks implemented one of my oldest feature requests for telemetry evolution: You can now create "dashboards" with multiple plots and link to them.
I've created a dashboard with some memory-related plots I care about at http://bit.ly/memorytelemetry.
I'll try to keep that link up to date so you can bookmark it and always see an interesting set of graphs.
You can create your own dashboard by pulling up Telemetry Evolution, clicking the gear at the top right of the graph, and clicking "duplicate".
Once upon a time, I wanted to add telemetry for counting the number of zombie compartments present in an instance of Firefox.
The trouble is, it's hard to write code which can tell whether a compartment is a zombie. Our manual procedure for finding zombies involves closing most of your open tabs. But of course we can't close all your tabs every time we want to record the number of zombies for telemetry.
This is a story of the creature which solved that problem, and how we used it to verify a large bugfix.
My solution to the zombie-identification problem was to give up on it.
Instead, I realized that (almost?) all compartment leaks are due to leaks of DOM windows. So perhaps we could identify leaked windows instead of leaked compartments.
I quickly gave up on that, too. It's hard to tell for sure whether a window is leaked. For example, suppose a page does
window.my_popup = window.open('http://mozilla.org');
window.my_popup.close();
The mozilla.org window will now live until we close the top-level page (or until the page clears window.my_popup
).
This situation is difficult to distinguish from the case when Firefox chrome JS code (either baked into Firefox or from an add-on) holds a reference to the mozilla.org
window. But in the first case, it's (potentially) a bug in a page, while in the second case, it's a bug in Firefox. We need to distinguish between these cases in order to have a reasonable notion of "leaked window".
My next realization was that I could get somewhere by expecting less of myself. Enter ghost windows.
If my goal is to collect telemetry on the number of leaked windows, I don't have to identify every single leak with zero false positives. If I were instead to capture the "obvious" leaks with a relatively low false-positive rate, that would still be an interesting number.
Ghost windows are an attempt to capture "obviously" leaked windows. We say that a window is a ghost if it meets three criteria:
The window's docshell is null, which basically means it is not part of any tab.
All other windows from the same top-level origin also meet criterion (1) above. (That is, if you have a tab open to http://mozilla.org
, then a window from http://people.mozilla.org
will never be a ghost.)
The window has met the two criteria above for at least 60 seconds.
The experienced zombie hunter may notice that these are basically the same criteria that one uses when looking for zombie compartments by hand.
Criterion (3) is in place because windows don't get destroyed immediately once they become garbage; it takes potentially multiple runs of the garbage and cycle collectors to destroy a window, and we don't want to flag a window as a ghost until we believe it's survived a sufficient number of collection attempts.
Criterion (2) tries to exclude the case when a page keeps a window alive, as in the window.my_popup
example above. A page can keep a window from any origin alive, but my hope was that most of the time, pages would touch windows only from the same top-level origin. (For example, mail.google.com
might keep alive plus.google.com
, but hopefully not bing.com
.)
This is a crude heuristic, but it seems to work well enough: I get few false positives in practice.
Notice that criterion (2) means that a ghost window can appear and disappear as you browse. If you have a leaked nytimes.com
window, it will disappear as a ghost window as soon as you open a new tab to nytimes.com
. It's this behavior that suggested the name "ghost". Again, this is OK because the goal is just to capture the obviously leaked windows.
One final note: You can view the list of ghost windows in about:compartments. There, we waive the 60-second waiting period, because loading about:compartments triggers a set of GC/CCs which should collect all garbage windows.
So now that we have ghost windows, let's use them.
You may have heard that Kyle Huey landed a big add-on leak fix for Firefox 15. We believe that his patch eliminates the vast majority of window leaks due to add-ons, and in our experience, leaked windows are the most common and the most severe type of add-on leak.
You may also have heard that Kyle's fix caused new leaks with add-ons built against old versions of the Jetpack SDK, and that we later fixed this problem.
I wanted to verify whether these changes actually had the effect we expected, and thankfully, I'd landed telemetry for ghost windows a few weeks before, so we could observe how his patches changed this measure.
Here's the plot. (This is data from Firefox nightlies on Windows.)
There are three points of interest here:
4/26 — Kyle's first patch landed.
Despite the fanfare surrounding its landing, this was actually the smallest change of the three. There's little to no change in the mean number of ghost windows (the blue dots), but there is a drop in the 75th percentile (top of the black vertical lines).
5/4 — This was an unexpected drop. My best guess is that it was caused by bug 751561, because that's the only change in the changeset range which seems plausible.
We don't have a good explanation for why this change affected the number of ghost windows.
5/10 — Kyle's second patch landed, fixing the jetpack issues.
(If you want to look at the latest data, go here, sign in
with any BrowserID Mozilla Persona account, click "telemetry
evolution", then select GHOST_WINDOWS
in the drop-down.)
This data is surprising, and not just because of the drop on 5/4 that we can't explain. I expected a big drop after Kyle's first patch and a small drop after his second one, but we observed the opposite.
I don't know exactly what that means, but I'd guess either a lot of our nightly testers were using add-ons built with old versions of the Jetpack SDK, or the Jetpack fix also fixed other problems we weren't aware of.
I think we can conclude from this data that we're in a much better state now with respect to ghost windows than we were. This data is also evidence that counting ghost windows actually measures leaks in Firefox.
I think it's reasonable at this point to ask: Why bother with telemetry for ghost windows? Couldn't I — in fact, shouldn't I — have looked at actual memory usage, in the form of resident set size (RSS)?
Of course RSS is an important metric, but there are a few reasons ghost windows are a reasonable thing to look at:
RSS is noisy.
There may have been a drop in RSS numbers since we landed these fixes, but the data is noisy enough that I couldn't use the RSS data to prove that we fixed anything. In contrast, the trend in the ghost windows data is clear.
Leaked windows are harmful even when they don't perceptibly increase memory usage.
In particular, leaked windows can cause cycle collector times to increase, because as soon as a window's docshell becomes null, the cycle collector thinks it may be garbage and tries to collect it. In its vain effort to collect the leaked window, the CC traces every DOM node attached to the window, causing longer CC pauses.
The lesson here is that, for verifying fixes, specific measurements are often more useful than general ones. Since you probably want at least a week's worth of baseline data, verifying fixes can take some planning!
Ghost windows were developed before compartment-per-global. With CPG, it may be possible to remove ghost window criterion (2), which would be great.
We have one other add-on related bug on our plate at the moment. It turns out that our FUEL API never frees most of the objects it creates. And under some circumstances, it creates a lot of objects.
There are patches in the bug awaiting review; I hope we'll have this fixed for Firefox 15 or 16.
The fact that FUEL has been leaking for a long time (since 2007?) suggests to me that there may be other dark corners of Firefox which are similarly problematic.
But the bottom line is, we're making progress. I'm hopeful that many users with leaky add-ons will find Firefox 15 to be a noticeable improvement.