Interface Dispatch

2018-07-22 13 min by Lukas Atkinson Object-Oriented Programming (6) Java (1) C++ (3)

Virtual method calls are simple: you just look up the method slot in a vtable and call the function pointer. Easy! Well, not quite: interfaces present a kind of multiple inheritance, and things quickly become complicated.

This post discusses interface method calls in C++ (GCC), Java (OpenJDK/HotSpot), C# (CLR), Go, and Rust.

It is an expanded version of my answer on Software Engineering Stack Exchange on Implementation of pure abstract classes and interfaces.

Disclaimer: Undefined Behavior and optimizations.

This article discusses PL concepts. The presented approaches are typical of language implementations, but not mandated by the language specification. In particular, clever optimizations might be able to remove some of the machinery that will be discussed here (in particular in the case of C#).

Vtables.

For a refresher on virtual methods and why they are important please read my article Dynamic vs. Static Dispatch.

In short, one of the important features of OOP is that a virtual method call doesn't depend on the static type of a variable, but on the dynamic type of the object.

void call_the_method(StaticType& object) {
  // call target unknown at compile time
  object.method(42);
}

// will call DynamicType::method() at run time
DynamicType object {};
call_the_method(object)

This requires that the information about the method implementations is stored within the object. This is typically done with a VTable (virtual method table). This is essentially a struct of function pointers.

If we desugar the above C++ code to C, the method call would actually look like this:

void call_the_method(StaticType* object) {
  object->vtable.StaticType_method(object, 42);
}

(Note that the object pointer is passed as an implicit this argument.)

Equivalently to a struct of function pointers we can think of an array of function pointers slots, where we have to know the slot index at compile time:

void call_the_method(StaticType* object) {
  object->vtable[SLOT_StaticType_method](object, 42);
}

The object then points to a vtable that is filled with the correct method implementations for its dynamic type.

Typically the vtable pointer is directly at the start of the object, so if we ignore the type system and UB (as the compiler can do) we get something like:

void call_the_method(StaticType* object) {
    ((SLOT*)(*object) + SLOT_StaticType_method)(object, 42);
}

As a pointer diagram:

It is important that all subclasses like the DynamicType still conform to the object and vtable layout of its base classes like StaticType. This is important for abstraction: the call site of a method can be compiled before the dynamic type even exists!

Multiple base classes? Multiple VTables.

With multiple inheritance, we have multiple base classes to which we have to stay compatible. And every base class layout expects that the vtable pointer is right at the start of the object. Of course they can't all be at the start! Instead, we put the extra vtable pointers in the middle. As a consequence, upcasting to a different base type means that we have to adjust the object pointer.

Let's consider this C++ class hierarchy:

class FirstBase {
  int a;
public:
  virtual int first_method();
};

class SecondBase {
  int b;
public:
  virtual int second_method();
};

class Derived : public FirstBase, public SecondBase {
  int c;
public:
  int first_method() override;
  int second_method() override;
};

By running g++ -fdump-class-hierarchy we get this cryptic output:

Vtable for FirstBase
FirstBase::_ZTV9FirstBase: 3 entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI9FirstBase)
16    (int (*)(...))FirstBase::first_method

Class FirstBase
   size=16 align=8
   base size=12 base align=8
FirstBase (0x0x7fb98760c960) 0
    vptr=((& FirstBase::_ZTV9FirstBase) + 16)

Vtable for SecondBase
SecondBase::_ZTV10SecondBase: 3 entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI10SecondBase)
16    (int (*)(...))SecondBase::second_method

Class SecondBase
   size=16 align=8
   base size=12 base align=8
SecondBase (0x0x7fb98760c9c0) 0
    vptr=((& SecondBase::_ZTV10SecondBase) + 16)

Vtable for Derived
Derived::_ZTV7Derived: 6 entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI7Derived)
16    (int (*)(...))Derived::first_method
24    (int (*)(...))-16
32    (int (*)(...))(& _ZTI7Derived)
40    (int (*)(...))SecondBase::second_method

Class Derived
   size=32 align=8
   base size=28 base align=8
Derived (0x0x7fb9874c82a0) 0
    vptr=((& Derived::_ZTV7Derived) + 16)
  FirstBase (0x0x7fb98760ca20) 0
      primary-for Derived (0x0x7fb9874c82a0)
  SecondBase (0x0x7fb98760ca80) 16
      vptr=((& Derived::_ZTV7Derived) + 40)

This shows the class layouts and vtable layouts with all the offsets. Note that the vtables don't just contain method slots, but two additional entries:

One entry points back to the vtable itself. This allows the dynamic type of an object to be determined at runtime (RTTI).
Another entry describes the offset from the upcasted this pointer to the original this pointer. It is necessary to fix up the object pointer when calling a method through the SecondBase if that method was implemented in the Derived class.

As a pointer diagram using a SecondBase static type:

C++ style multiple inheritance. Legend: V: vtable pointer, O: object offset, T: runtime type information (RTTI), S: method slot.

How does this solution score?

The vtable is part of the object, so a simple pointer to the object is sufficient to pass objects around.
However, this doesn't scale well with many bases/interfaces: Each additional base class requires us to add another vtable to the object.
Upcasting is not free, and may change the object pointer. This can require some fixup later.
Virtual method calls require three layers of pointer indirection and no branches.
All of this involves a small, constant, but unavoidable overhead.

Java's itables.

When Java was designed they looked at C++ and went “nah, this is too complicated” and made a single-inheritance language. But complexity can't entirely disappear, it can only be shuffled around. Supporting interface inheritance is still a kind of multiple inheritance. So how was this solved?

Each Java object has an object header which includes a pointer to its class. This class serves as a vtable and is used during normal virtual method calls. When we upcast an object reference to an interface type no changes are made. So the JVM cannot use vtable dispatch for interface calls.

Virtual method calls only need the method slot index of the vtable.

Interface calls need the interface ID and the method slot index for the interface's vtable, which is called an itable.

So how do we find the correct itable? In HotSpot, the runtime literally searches through an array of itables until one is found that matches the interface ID (source: interpreter, x86 assembler). As pseudocode:

// Dispatch SomeInterface.method
Method const* resolve_method(
    Object const* instance,
    Klass const* interface,
    uint itable_slot)
{
  Klass const* klass = instance->klass;

  for (Itable const* itable : klass->itables()) {
    if (itable->klass() == interface)
      return itable[itable_slot];
  }

  throw ...;  // class does not implement required interface
}

So a call site would be compiled like:

// SomeInterface object = ...;
// object.method(42);
Method* m = resolve_method(
    object, typeid(SomeInterface), SLOT_SomeInterface_method);
m(object, 42);

A pointer diagram of this structure is a bit unwieldy, and might look like:

Inline caching.

Of course running all that dispatch machinery takes much longer than a direct vtable lookup. In practice this doesn't matter because the call site can remember the lookup result. Most of the time, all objects at a call site will have the same dynamic type. And because each object has a single pointer to the class we can make very cheap comparisons for the dynamic type.

In the simplest case, we assume that this call will have the same target as the previous call at that location. We can then write an optimized lucky path, and fall back to the expensive resolver when our assumption is wrong:

// SomeInterface object = ...;
// object.method(42);

static Klass* cached_type = nullptr;
static Method* cached_method = nullptr;

// guard condition
if (LIKELY(object->klass == cached_type)) {
  // call the cached method
  cached_method(object, 42);
}
else {
  // patch this call site with the resolved type
  cached_type = object->klass;
  cached_method = resolve_method(
      object, typeid(SomeInterface), SLOT_SomeInterface_method);
  // call the patched method
  cached_method(object, 42);
}

Here I've shown this with static variables, but usually a JIT compiler can hard-code the cached type and cached method pointers into the machine code. However, recompiling the callsite may be expensive so there will usually be a counter for cache misses before the call site is patched.

The Wikipedia article on Inline caching discusses this and related techniques in more detail.

How does this itable-based dispatch with inline caching score? We have to distinguish between worst case behaviour (on cache misses and during VM warmup), and typical behaviour.

Typically, we need one branch and two pointer indirections. Assuming good branch prediction, this is better than vtable dispatch.
Worst case, we have an unbounded number of pointer indirections and branches. More precisely, they scale linearly with the number of implemented interfaces of the dynamic type. At a minimum, there will be five levels of pointer indirection. Because the dispatch effort is unbounded, interface calls outside of carefully controlled class hierarchies might be unsuitable in real-time systems.

C#'s slotmaps.

C# generally follows the same approach as Java, but tries to perform more aggressive optimizations. For example, interface dispatch in C# should be understood primarily through inline caching (which the CLR calls Virtual Stub Dispatch). Finding the method in the vtable is only a strategy of last resort, and typically only happens during VM warmup.

C# has one very important difference to Java: how generics are implemented. In Java, type parameters are erased at run time. An ArrayList<Integer> is the same as an ArrayList<?> at run time. Not so C#: for each set of type parameters, the class structure is copied and sometimes specialized for those parameters. A List and List<int> are distinct types. The relationship between a specialized class and generic class is similar to inheritance. The result is that C# has a lot more class structures in memory.

The C# developers noted that Java-style itables and itable arrays would add up to a lot of storage, and looked for ways to compress them. In Java, each class that overrides an interface method needs to have it's own itable copy. The CLR avoids this through more complicated vtable structures.

Instead of using ordinary itables that are vtables for interfaces, the CLR uses the concept of slot maps. These don't hold a method pointer but a vtable slot index. The advantage is that the slot map doesn't have to be rewritten when a method is overridden (or specialized for generics) – it's sufficient to update the vtable. This also makes it more feasible to replace a method pointer with an optimized version, as only the vtable has to be updated.

The pseudocode for slot map based dispatch would look like this:

Method const* resolve_method(
    Object const* instance,
    Klass const* interface,
    uint interface_slot)
{
  Klass const* klass = instance->klass;

  // Walk all base classes to find slot map
  for (Klass const* base = klass; base != nullptr; base = base->base()) {
    for (SlotMap const* slot_map : base->slot_maps()) {
      if (slot_map->klass() == interface) {
        // Note the extra slot index indirection
        uint vtable_slot = slot_map[interface_slot];
        return klass->vtable[vtable_slot];
      }
    }
  }

  throw ...;  // class does not implement required interface
}

I won't even attempt the corresponding pointer diagram :)

And as usual, the reality is way more complex than this description. If you are interested in more details, please read:

CoreCLR (2006): Virtual Stub Dispatch. In: Book Of The Runtime. Discusses slot maps and caching to avoid expensive lookups. A lot of the corresponding code is in src/vm/methodtable.h and related files.
Pobar, Neward (2009): SSCLI 2.0 Internals. Chapter 5 of the book discusses slot maps in great detail. Was never published but made available by the authors on their blogs. The PDF link has since moved. This book probably no longer reflects the current state of the CLR.
Kennedy, Syme (2001): Design and Implementation of Generics for the .NET Common Language Runtime. (PDF link). Discusses various approaches to implement generics. Generics interact with method dispatch because methods might be specialized so vtables might have to be rewritten.

Fat pointers.

There has to be an interface vtable somewhere. C++ sticks it in the object, Java into the class. But there's a third option: we can track the vtable with the object pointer:

The vtable in the fat pointer always contains the appropriate vtable layout for the current static type. This has a number of interesting consequences. For example, virtual calls now require one less pointer dereference. Interface calls are now efficient enough that we do not need inlining to get bearable performance (might still be helpful, though).

A game-changing advantage is that interfaces can also be implemented externally from the object. The interfaces don't have to be known up front when a class is implemented. As the object layout is unaffected, we can even implement interfaces for POD types and primitive types such as unboxed integers! (Casting to an interface would effectively box the primitive type.) And in principle, it would be possible to provide multiple differing implementations for each type/interface combination, with one implementation being selected at the site of an upcast.

So it's no surprise that this strategy is used e.g. for Go interfaces, for Rust dyn Traits, and to some degree in Haskell. The ability to implement interfaces externally corresponds well to Haskell's typeclasses.

If this is so brilliant, why isn't everyone using it? There are three general problems: fat pointers aren't normal pointers, upcasting can be expensive, and this can complicate a compiler.

Fat pointer size overhead.

A fat pointer needs a vtable pointer in addition to the data pointer, and is therefore twice as big. To move an object reference around we now need more than one register. To store an object reference we need two machine words, so e.g. an array that contains interface references is now twice as big. But in that array many elements are likely to have the same concrete type, therefore wasting a lot of space for little gain. Under this assumption, approaches as in C# or Java can be a lot more memory-efficient and therefore also more cache-efficient.

Fat pointer conversion overhead

When a concrete type is upcasted to an interface type, this is now no longer a free operation (in Java this is a no-op, in C++ this is at most a pointer offset). Instead we need to find the correct vtable. So part of the cost of dynamic dispatch has been moved from the call site to the cast site. This is not usually a big problem because the concrete type is statically known at the cast site so the correct vtable can be found at compile time.

Things are a bit more interesting when casting between interfaces (e.g. in Go) or when combining multiple interfaces/traits/typeclasses. There is no elegant solution.

Rust currently disallows trait combinations in trait objects. As a workaround, you'd have to introduce adapter types that wrap the object, or add a new trait that describes your use.

Go calls back into the runtime library for interface casts, which looks up the itable in a global hash table, and creates a new itable if none was found in the hash table. To construct a new itable the methods are looked up by name in the concrete type's global vtable. This is quite flexible but implies a horrendous runtime cost, even on the “happy path” where the itable already exists. As soon as interfaces are involved, Go is more like Python than C or C++. (Note: but see Russ Cox's comment below which clarifies that interface conversion overhead is negligible in practice.)

Compiler implementation considerations.

Finally, fat pointers can complicate the design of a compiler.

First, there's the ABI concern that interface types (objects) and normal pointers have different sizes. This is usually not a problem, but e.g. means that this strategy is not suitable for C++.

Due to the pointer size concerns, we might want to use normal vtable dispatch as in C++/Java for class inheritance and fat pointers for interface inheritance. Now the compiler needs to manage two completely independent types of objects. However, this is already the case in Java as interface calls use a different VM opcode. (In fact the Java specification explicitly anticipates implementations using fat pointers.)

Another concern is how reflection and run time type information can be managed, if the object no longer contains a vtable that can be used to identify its type. It's correct that this no longer allows us to take an arbitrary pointer and find out which type of object it references. But that also shouldn't be necessary: either we have a simple pointer to a statically known type, or we have a fat pointer. The vtable of the fat pointer can contain additional information, such as a reference to the concrete type. This can be used for downcasting back from a reference type to a concrete type.

Conclusion.

We have looked at various ways to resolve interface calls, which is a kind of multiple inheritance: storing multiple vtables per object as in C++, storing itables in the class structure as in Java or C#, or storing itables as part of a fat pointer as in Go or Rust.

None of these solutions are inherently better than the others. They all add some runtime overhead (space or time**, and they all have consequences for the language semantics that can be expressed. In some cases the overhead can be amortized through caching.

A whole class of languages has been ignored here: those that do method lookup by name, e.g. Python. That relaxes the constraints on vtables because named slots don't have to stay in a baseclass-compatible order. However name-based dispatch plays in a different league performance-wise and is of no interest in the context of vtable-based dispatch.

This article has also been discussed on Hacker News and r/programming.

next post: MongoDB no longer seeks OSI approval for SSPL
previous post: gcovr 4.0

Comments

What are your thoughts on this post? Do you have some feedback, or does something need to be corrected?

Thanks for writing this up, it's super helpful! A couple of random observations:

An interesting difference between Java and C++ is that in Java you can treat List<Derived> as List<Base>, and this is possible, because casting Derived* to Base* is a no-op. In C++, such casts are an explicit coercion, so you can't treat vector<Dervied*> as vector<Base*>. I wonder if it is possible to say that C++ lacks true sub typing?

I am not sure that "fat pointer size overhead" section is entirely correct. For example, in this case:
> To store an object reference we need two machine words, so e.g. an array that contains interface references is now twice as big.
the total size would be the same? With fat pointers, the array is bigger, but with thin pointers, the objects themselves a bigger. In any case, there's only one vtable per class, and one vtable ptr per object, so total is the same. I think memory usage of fat pointers can be both greater and smaller than that of thin pointers. It will be greater if there is a lot of aliasing: if you have many references to the single object, than each reference carries the vtable. It will be smaller if you have many concrete objects, but work via interface with only one of them. For example, we can have an array of Spheres (concrete types), and pass each one to fn render(obj: &dyn Render). In this setup, there's only one vtable pointer for all objects. Of course, passing fat-pointers around is costlier, because they are twice as big.

Finally, I'd like to add another "compiler implementation consideration". Loads and stores of fat pointers are not atomic on mainstream platforms, so you can observe object tearing. I heard that it was/is possible to segfault go due to this reason (haven't actually tried this myself, so I might be wrong). In contrast, in Java data races are always benign, you might not get the pointer you want, but at least it's static and dynamic types would agree.

Regarding Java vs C++, the issue is not that C++ had a bad type system but rather that templates and generics are totally different. In C++, vector<Base*> and vector<Derived*> are unrelated types. In Java, List<Base> and List<Derived> aren't interchangeable either, but we can express types such as List<? super Base> that allow for type argument variance.

Regarding fat pointer size overhead, the idea is that objects in the OOP sense are always polymorphic (Rust lingo: trait objects). There are no C++ or Rust-style plain values. That means that every object has at least one reference. Per-object overhead is then strictly equal or greater to HotSpot/CLR type implementations, but possibly better than GCC style implementations that can have multiple vtable pointers per object. Of course, part of the beauty of the fat pointer approach is that plain values can be borrowed as objects.

The point about atomicity is really good! That does make lock-free data structures more difficult.

> Regarding Java vs C++, the issue is not that C++ had a bad type system but rather that templates and generics are totally different.

I understand that, I was trying to say that Java can't use C++ strategy of adjusting the pointers. A better example would probably be something involving function pointers: In theory, Derived* () can be a subtype of Base* () in C++. But, if you need to coerce Derived * to Base * for upcasting, you'll also have to adjust functions for upcasting.

Nice writeup. It's especially nice to have the comparison of all the different language techniques in one place.

Regarding this:

Go calls back into the runtime library for interface casts, which looks up the itable in a global hash table, and creates a new itable if none was found in the hash table. To construct a new itable the methods are looked up by name in the concrete type’s global vtable. This is quite flexible but implies a horrendous runtime cost, even on the “happy path” where the itable already exists. As soon as interfaces are involved, Go is more like Python than C or C++.

I don't quite understand why you say this is a "horrendous runtime cost". The key insight for Go was that in a statically typed language, type conversions happen far less often than method calls, so doing the work on the type conversion is actually quite cheap. Also, the "happy path" for an interface->interface conversion is a single hash table lookup using a key that is an XOR of two values fetched out of the corresponding types (concrete source type and target interface). It's cheap and lock-free. The first time a given conversion happens must construct a table, of course, but anything that affects performance has to happen many many times to do so, and only the first costs anything. And even that first conversion is only O(n) in the number of methods.

Empirically, in our production systems at Google, where we have continuous profiling, runtime.convI2I accounts for less than 0.007% (that is, less than 70ppm) of our CPU time spent in Go programs. For us, at least, it is not a "horrendous runtime cost".