Programming to interface… without paying the cost

Programming to interface is a fairly good programming practice. It allows decoupling of classes, as stated in the Dependency Inversion Principle, as well as easy testing in context of Test-driven developpement. Stricly speaking, there is no interface in C++. Instead, we use inheritance from a base class with no data member and all its member functions marked as public, pure and virtual¹. But calling virtual functions doesn’t come for free, we need to pay the cost of : making an indirect call through the vtable, branch (mis)prediction, instruction cache miss, … Should we pay this cost just because we want to do things right ? As we will see in this post, in some cases we could get rid of this price using the power of Devirtualization.

Before getting into the technique of devirtualization, let’s see what’s wrong with the virtual function call with an exemple. Let’s say we need to model a computer with a pretty simple interface int compute(void). To prevent our computer user to depend on a particular implementation of the computer, we model our computer as an interface, so computer user only relies on the computer’s public API. Doing this eases testing of computer user by allowing us to provide to the computer user a mock implementation instead of the real computer implementation. Obviously, our software run the AllMightyComputer wich implements the Computer interface and of course just return 42. Bellow is the corresponding code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class Computer {
public:
    virtual ~Computer() = default;
    
    virtual int compute(void) = 0;
};

class AllMightyComputer : public Computer {
public:
    virtual ~AllMightyComputer() = default;

    int compute(void) override { return 42; }
};

class UserOfComputer {
public:
    UserOfComputer(Computer & aComputer)
    : computer(aComputer)
    { }

    int run(void) { return computer.compute(); }

private:
    Computer & computer;
};

int main(void) {
    AllMightyComputer allMightyComputer;
    UserOfComputer user(allMightyComputer);
    auto result = user.run();

    std::cout << "Answer to life the universe and everything: " << result << std::endl;
    return 0;
}

Note: Don’t forget to declare interface’s destructor as virtual to be able to virtualy delete (that is through a pointer to interface) objects implementing this interface.

Looking at this code, it’s obvious that AllMightyComputer::run() will finaly be called and just return the constant 42. So we could reasonably think the compiler will optimize this and just call operator << with the constant 42… Or maybe not.

What’s wrong with virtual function call ?

So far so good, but what’s wrong with this code ? To see the problem, let’s look at the assembly code when we compile it without optimizations (godbolt). In particular, the code of UserOfComputer and the main function allocating all objets and injecting dependencies:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
AllMightyComputer::compute():
    movs    r0, #42
    bx      lr

UserOfComputer::UserOfComputer(Computer&):
    str     r1, [r0]
    bx      lr

UserOfComputer::run() [clone .isra.0]:
    ldr     r3, [r0]
    ldr     r3, [r3, #8]
    bx      r3

main:
    push    {r4, lr}
    sub     sp, sp, #8

    ; Allocate the AllMightyComputer object
    ; That is, storing the address of its vtable on the stack at offset #0
    ldr     r3, .L17
    str     r3, [sp]

    ; Call the UserOfComputer's constructor passing the reference to the Computer
    ; r0 = 'this' pointer of UserOfComputer object (on the stack at offset #4)
    ; r1 = address of vtable of AllMightyComputer
    mov     r1, sp
    add     r0, sp, #4
    bl      UserOfComputer::UserOfComputer(Computer&)

    ; Call UserOfComputer::run() and store result into r4
    ; r0 = this pointer of UserOfComputer object (on the stack at offset #4)
    ldr     r0, [sp, #4]
    bl      UserOfComputer::run() [clone .isra.0]
    mov     r4, r0

    ; Display the text message
    ldr     r1, .L17+4
    ldr     r0, .L17+8
    bl      std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)

    ; Display the result of call to UserOfComputer::run()
    mov     r1, r4
    bl      std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
    
    ; Display std::endl
    ldr     r1, .L17+12
    bl      std::basic_ostream<char, std::char_traits<char> >::operator<<(std::basic_ostream<char, std::char_traits<char> >& (*)(std::basic_ostream<char, std::char_traits<char> >&))
    
    ; Return 0
    movs    r0, #0
    add     sp, sp, #8
    pop     {r4, pc}

.LC0:
    .ascii  "Answer to life the universe and everything: \000"

.L17:
    .word   vtable for AllMightyComputer
    .word   .LC0
    .word   _ZSt4cout
    .word   _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_

vtable for UserOfComputer::run():
    ; Simplified vtable
    .word   _ZN17AllMightyComputerD1Ev
    .word   AllMightyComputer::~AllMightyComputer() [deleting destructor]
    .word   AllMightyComputer::compute()

Note: Following ARM 32bits calling conventions, this pointer is passed as first argument (register r0), the first three arguments of the method in registers r1 to r3 and the returned value put into r0.

The implementation of AllMightyComputer::compute() is straight forward, it just returns 42 (in register r0) as expected. From the constructor of UserOfComputer, we see that the reference to the computer (argument r1) - in fact, r1 contains the address of a pointer to the vtable for AllMightyComputer - is just stored at this address (argument r0). Then when UserOfComputer::run() is called, from this address (argument r0) we restore the address of the vtable for AllMightyComputer into register r3. From offset #8 of AllMightyComputer’s vtable, we load the address of AllMightyComputer::compute() method, again into r3, and directly branch to it. And finaly, in the main function we allocate space on stack to store the pointer to the vtable for AllMightyComputer and the UserOfComputer object. We call its constructor and then finaly UserOfComputer::run(). It’s a lot of memory access (just count the number of memory store str / load ldr instructions!) and pointers indirecting, just to return 42!

So what can we do to optimize it ?

Optimizing virtual calls

Fortunately, all recent compilers support an optimization technique named devirtualization. In cases like our example, the compiler knowns that the reference to Computer passed to UserOfComputer is an instance of AllMightyComputer and nothing else. So, using devirtualization hand in hand with inlining (replacing function call by there content) and constant propagation techniques, allows the compiler to see that it can replace all memory accesses and pointer indirection… with the constant 42!

For gcc and clang, the compiler flag used to activate it is -fdevirtualize and is included in the -O2 optimization level (it even seems to be included at -O1…). Let’s see the generated assembly with this optimization level (godbolt):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
main:
    push    {r4, lr}
    ldr     r4, .L7

    ; Display the text message
    movs    r2, #44
    ldr     r1, .L7+4
    mov     r0, r4
    bl      std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, int)

    ; Display 42
    movs    r1, #42
    mov     r0, r4
    bl      std::basic_ostream<char, std::char_traits<char> >::operator<<(int)

    ; Display std::endl
    bl      std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&)
    
    ; Return 0
    movs    r0, #0
    pop     {r4, pc}

.LC0:
    .ascii  "Answer to life the universe and everything: \000"

.L7:
    .word   _ZSt4cout
    .word   .LC0

Now, we have an assembly that look like what we expected at first sight! The compiler has generated (lines 12 to 14) a simple call to operator << with the constant value 42. And that’s all!

This was a really simple example and compiler did a very good job without developer hint. But in real code, this may not work as easily.

Helping the compiler

In this basic example, it was quite easy for the compiler to devirtualize the call to the virtual method. In real world code, the challenge could be lot more harder. To help the compiler get the most out of your code, C++11 introduce a new keyword named final. Virtual function could be marked as final in order to tell the compiler that this function will not be overwritten by derived classes. Here is what it look likes for our example:

1
2
3
4
5
6
class AllMightyComputer : public Computer {
public:
    virtual ~AllMightyComputer() = default;

    int compute(void) override final { return 42; }
};

Furthermore, the final keyword could be applied to the whole class instead of virtual functions. In this case, the class as a whole can’t be derived, she’s sealed. It could be used like this:

1
2
3
4
5
6
class AllMightyComputer final : public Computer {
public:
    virtual ~AllMightyComputer() = default;

    int compute(void) override { return 42; }
};

Using the final keyword could help the compiler to apply devirtualization in situation where this is not obvious.

Optimizing across translation units

In this example, all of the classes were put into the same source file. Also, all functions are defined inline. Again, in real world code, this is not the case. Good practices suggest to put separate classes into separate files (a particular case of the Single Responsibility Principle: one file = one class = one responsibility), as well as to separate declaration (in header file) from definition (in cpp file). Doing this could prevent the compiler to apply optimizations like devirtualization, inlining and so accross the whole program.

To circumvent this problem, compilers implements a technique called Link Time Optimization which allows trying to apply above optimizations at link time, that is to say for the whole program. Under gcc/clang, the option is named -flto and could be used like this:

1
2
3
4
g++ -c -O2 -flto AllMightyComputer.cpp
g++ -c -O2 -flto UserOfComputer.cpp
g++ -c -O2 -flto main.cpp
g++ -o prog -O2 -flto AllMightyComputer.o UserOfComputer.o main.o

Notice that, even at link time (line 4), g++ should be used instead of the classical ld. Also, when linking a static library built with LTO option, add the flag -fuse-linker-plugin (gcc only) to tell gcc to include the library into the global optimisations.

1
g++ -o prog -O2 -flto -fuse-linker-plugin AllMightyComputer.o UserOfComputer.o main.o libOptimizedWithLTO.a

Link Time Optimization is an heavy process, memory and time consuming. To mitigate this, don’t hesitate to look at compiler documentation to activate options that speedup link time: paralellizing link, using thin LTO, …

Conclusion

Good coders apply good pratices: Programming to interface (or Dependency Inversion Principle), putting distinct classes into distinct files, separating declaration from definition, … But in C++, this good practices may lead to performance penalties, even more in embedded code. Fortunately, today’s compilers are able to mitigate this penalty or even remove it completely using techniques such as inlining, devirtualization or link time optimization.

Help your compiler by giving it optimization options (-O2, -O3, -flto, …) and code hints (final keyword) ; so that in return, your compiler will help you to be a good coder.

The Performance Benefits of Final Classes by Sy Brand
C++ Devirtualization by Vlad Lazarenko
The power of devirtualization by Marco Foco
Devirtualization in C++ series by Honza Hubička
Virtual, final and override in C++ by Jonathan Boccara

Others techniques may be used to implement interfaces in C++ (template, …), but there are out of scope of this article. ↩

Programming to interface… without paying the cost

Published on Jan 13, 2021 |
C++

Devirtualization

Dependency Inversion Principle

Programming to interface

Test Driven Development

What’s wrong with virtual function call ?

Optimizing virtual calls

Helping the compiler

Optimizing across translation units

Conclusion

Share this post!

Programming to interface… without paying the cost

Published on Jan 13, 2021 | C++ Devirtualization Dependency Inversion Principle Programming to interface Test Driven Development

What’s wrong with virtual function call ?

Optimizing virtual calls

Helping the compiler

Optimizing across translation units

Conclusion

Related articles

Share this post!

Published on Jan 13, 2021 |
C++

Devirtualization

Dependency Inversion Principle

Programming to interface

Test Driven Development