The Fastest VM Bytecode Interpreter

by bysin on 2010-11-21

Mat and I were scanning through github one day and a pretty lengthy, complex piece of code caught our eye (Caution: Do not read if you’re prone to seizures or have a heart condition). This code is one of the many intricacies involved in mono’s bytecode interpreter, and it was beautiful, at least to us. Why must it be so complex? How hard can it be? After a lengthy discussion, we decided the best thing to do at this point was have a competition to see which one of us can write the fastest VM bytecode interpreter in two hours. A few insults later (“Your mother’s filesystem is so fat etc.”), we decided to set a few ground rules and agree on a benchmark.

I came up with a pretty simple piece of code that contains addition, multiplication, and conditional branches. I then generated a generic bytecode equivalent of the benchmark to be used in our interpreter.

Benchmark in C Benchmark in Pseudo-bytecode
int s, i, j;
for (s = 0, i = 0; i < 10000; i++) {
	for (j = 0; j < 10000; j++)
		s += i * j;
printf("%d\n", s);
0: load 0
1: store r0
2: load 0
3: store r1
4: jmp 26
5: load 0
6: store r2
7: jmp 18
8: load r0
9: load r1
10: load r2
11: mul
12: add
13: store r0
14: load r2
15: load 1
16: add
17: store r2
18: load r2
19: load 10000
20: cmp
21: blt 8
22: load r1
23: load 1
24: add
25: store r1
26: load r1
27: load 10000
28: cmp
29: blt 5
30: eof

As a control, the C benchmark runs on the native machine at an average time of 0.233 secs. In my first attempt, I wrote a simple C program that reads each instruction and jumps to a corresponding block of code.

int vm(inst *i) {
	static void *optable[]={
		[OP_NOP] = &&op_nop,	[OP_LDI] = &&op_ldi,	[OP_LDR] = &&op_ldr,
		[OP_STO] = &&op_sto,	[OP_ADD] = &&op_add,	[OP_SUB] = &&op_sub,
		[OP_MUL] = &&op_mul,	[OP_DIV] = &&op_div,	[OP_MOD] = &&op_mod,
		[OP_ORR] = &&op_orr,	[OP_XOR] = &&op_xor,	[OP_AND] = &&op_and,
		[OP_SHL] = &&op_shl,	[OP_SHR] = &&op_shr,	[OP_NOT] = &&op_not,
		[OP_NEG] = &&op_neg,	[OP_CMP] = &&op_cmp,	[OP_BEQ] = &&op_beq,
		[OP_BNE] = &&op_bne,	[OP_BGT] = &&op_bgt,	[OP_BLT] = &&op_blt,
		[OP_BGE] = &&op_bge,	[OP_BLE] = &&op_ble,	[OP_CAL] = &&op_cal,
		[OP_JMP] = &&op_jmp,	[OP_RET] = &&op_ret,	[OP_EOF] = &&op_eof
	int r[4], s[32], *sp = s;
	inst *ip = i;
	op_nop:                           goto *(++ip)->jmp;
	op_ldi:     *sp++ = ip->arg;      goto *(++ip)->jmp;
	op_ldr:     *sp++ = r[ip->arg];   goto *(++ip)->jmp;
	op_sto:     r[ip->arg] = *--sp;   goto *(++ip)->jmp;
	op_add:     sp--, sp[-1] += *sp;  goto *(++ip)->jmp;
	op_sub:     sp--, sp[-1] -= *sp;  goto *(++ip)->jmp;
	op_mul:     sp--, sp[-1] *= *sp;  goto *(++ip)->jmp;

Click here to download the Full Source Code.

This works by creating an array of goto pointers (I believe this is a gcc extension), a pseudo-stack, and a list of registers, then jumping to each instruction while increasing the instruction pointer. This simple virtual machine executed the bytecode in 6.421 secs, which was way too slow for my taste, so I had to figure out another approach.

Why don’t I just compile the bytecode into x86 machine code, like modern JIT VMs? That could easily be my ticket to victory. I had about an hour left in the competition so I made haste. I began to replace the optable full of goto addresses into x86 instructions, then allocated some executable memory, copied the instructions, and jumped to it.

int vm(inst *i) {
	struct {
		int32_t size, arg, jmp;
		char data[16];
	} *op, optable[]={
		INS(op_nop, 0, 0, 0, 0x90),
		INS(op_ldi, 4, 1, 0, 0x68),
		INS(op_ld0, 0, 0, 0, 0x53),
		INS(op_add, 0, 0, 0, 0x58, LONG 0x01, 0x04, 0x24),
		INS(op_sub, 0, 0, 0, 0x58, LONG 0x29, 0x04, 0x24),
		INS(op_mul, 0, 0, 0, 0x5a, LONG 0x8b, 0x04, 0x24, LONG 0x0f,
			0xaf, 0xc2, LONG 0x89, 0x04, 0x24),
		INS(op_ble, 4, 0, 2, 0x0f, 0x8e),
		INS(op_cal, 4, 0, 1, 0xe8),
		INS(op_jmp, 4, 0, 1, 0xe9),
		INS(op_ret, 0, 0, 0, 0xc3),
	if (!(pn = mmap(0, m, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANON, -1, 0)))
		return 0;
	((void(*)()) pn)();
	printf("%d\n", r0);
	return 0;

Click here to download the Full Source Code.

On runtime this created a small, 122 byte x86 program based upon the benchmark bytecode which clocked in at an average speed of 0.518 secs. This was only around twice as slow as the control so I was fairly confident at this point.

I slickly inquired into what Mat was working on, and he informed me he was writing his bytecode interpreter in Visual Basic.NET. I was a bit skeptical at first, considering he did not know Visual Basic, but was reassured he wasn’t joking. Evidently he taught himself Visual Basic in the span of 2 hours to what amounts to be the ultimate coding troll. He’s not one to lose these competitions, so I assumed he has some trick up his sleeve. He submitted his code for approval:

	For ip = 0 to ops.Length - 1
		Dim i as VMI = ops(ip)
		Select Case i.Opcode
			Case "ldi"
				il.Emit(Opcodes.Ldc_i4, i.Operand)
			Case "ldr" 
				il.Emit(Opcodes.Ldloc, i.Operand)
			Case "sto"
				il.Emit(Opcodes.Stloc, i.Operand)
			Case "jmp"
				il.Emit(Opcodes.Br_S, jmp(i.operand))
			Case "mul"
			Case "add"
			Case "eof"
				il.Emit(Opcodes.Ldloc, 0)
			Case "cmp"
				Select Case (ops(ip+1).opcode)
					Case "blt"
						il.Emit(Opcodes.Blt, jmp(ops(ip+1).operand))
					Case Else
						Console.WriteLine("unsupported branch: {0}", ops(ip+1).opcode)
				End Select
				ip = ip + 1
			Case Else
				Console.WriteLine("unsupported opcode: {0}", i.Opcode)
		End Select

	Console.WriteLine("{0}", _
		CType(program.CreateDelegate(GetType(tmplP1(Of Long, Integer))), tmplP1(Of Long, Integer))(0))

Click here to download the Full Source Code.

Once compiled, his code ran the benchmark at an average speed of 0.127 secs….. Wait, what?

# vbnc simvm.vb && time mono simvm.exe

real	0m0.127s
user	0m0.120s
sys	0m0.000s

I wouldn’t have believed it if I didn’t see it myself. My code generates native Assembly… Assembly! And his is written in Visual Basic. I’m sure there is some trickery going on, like mono optimizing the emitted instructions, but I haven’t as of yet ruled out witchcraft. I was forced to conclude that Visual Basic is faster than Assembly, that I’m a horrible coder, and Mat wins.

Program Benchmark Time
Control 0.233 secs
Ben’s C/x86 ASM VM 0.518 secs
Mat’s Visual Basic VM 0.127 secs


UPDATE: Its been mentioned that I didn’t compile the control with optimization on. I turned optimization off because gcc is way too damn smart. It almost literally translated the code into ‘printf(“857419840\n”);’. I think a better example would be if we didn’t give gcc the answer on compile-time, since none of the VM’s were given that opportunity until it read the instructions on run-time. The VM’s did not, and could not know ahead of time the loop amount, or even the general flow of the bytecode for that matter. So by saving the loop amount in a variable declared as volatile, you prevent gcc from optimizing it out:

#include <stdio.h>

int main(int argc, char **argv) {
    volatile int k = 10000;
    int s, i, j;
    for (s = 0, i = 0; i < k; i++) {
        for (j = 0; j < k; j++)
            s += i * j;
    printf("%d\n", s);
    return 0;

That code compiled with -O3 runs at 0.085 sec on my machine. Surprisingly its only 66% faster then the Visual Basic example.


nux November 21, 2010 at 15:59

Nerds…………………… ..

Volkan YAZICI November 21, 2010 at 16:58

Doesn’t there exist a loop-unroll optimizer for Mono bytecode interpreter? I’d expect Ben’s C/x86 ASM VM to run in same amount of time if the nested loops are submitted as a single loop. BTW, such a nerdy friendship! Cograts!

aaaaaaaaasdddddddddeefsw November 21, 2010 at 18:33

“You’re mother’s …” = “You are mother’s …”
WTF does that mean?

You mean “Your mother’s …”, right?

bysin November 21, 2010 at 18:44

Fixed a typo, thank you.

Nadav November 21, 2010 at 18:48

I liked this post. I am interested in making/seeing a comparison to the LLVM JIT. I am not sure about .NET, but LLVM does not have partial unrolling and vectorization, so its hard for me to estimate who might win this.

CBC November 21, 2010 at 18:52

Visual Basic is faster than Assembly

sorx November 21, 2010 at 19:37

It’s Mono’s loop-unroll optimizer like Volkan YAZICI said:) To be fair add -O3 to gcc to get optimized machine code timings and then compare! Who’s faster now?

$ time ./o0

real 0m0.289s
user 0m0.284s
sys 0m0.008s

$ time ./o3

real 0m0.002s
user 0m0.000s
sys 0m0.000s

Falaina November 21, 2010 at 20:20

You really should’ve mentioned the control program is unoptimized, as now it will distract from an actually useful point: being able to translate programs in arbitrary languages into bytecode for a heavily optimized platform is a huge boon to anyone writing a language implementation.

This is why you see all these language implementations popping up on the JVM (and to a lesser extent the CLR): It has a fairly robust instruction set, and you get to leverage all that work Sun put in to make the JVM fast. Your friend’s interpreter is a good example of the above point.

Name not required November 21, 2010 at 21:42


dmn November 21, 2010 at 22:27

First: It does raise a question of what is the minimum set of opcodes necessary? So, the JVM has a ton of opcodes, as does Parrot, etc., but could any of those be considered fluff?

Second: I may be completely off-base here, but isn’t most of the runtime of a VM taken up with I/O and translation to opcodes?

I am, admittedly, a VM-gizzards noob.

bysin November 21, 2010 at 22:47

The minimum? One.

I suppose you can implement a RISC VM, but the more instructions the better. There is little to no penalty in speed for increased instruction sets.

I had thought at one point the time it took for JIT to process would hinder the speed of the application. Yet, the speed improvement seems to outweigh the overhead.

earl November 22, 2010 at 00:49

“There is little to no penalty in speed for increased instruction sets.”

Well, actually there is. Once the hot code of your VM interpreter blows the physical CPU’s instruction cache, you’ll encounter a _severe_ performance hit.

Anders November 21, 2010 at 22:49

You can get away with just a single (complex) opcode and still be turing complete. It’s called OISC.

John November 22, 2010 at 03:05

What did you compile the C program with? Gcc? Icc? Llvm? Which level of optimization? One thing beating the unoptimized C, another a highly optimized one:)

James November 22, 2010 at 03:08

Imports System.Diagnostics

dim sw as new Stopwatch

Console.WriteLine(“Elapsed: {0}”,sw.Elapsed)
Console.WriteLine(“In milliseconds: {0}”,sw.ElapsedMilliseconds)
Console.WriteLine(“In timer ticks: {0}”,sw.ElapsedTicks)

and got:

c:\temp>vbc /optimize simVB.vb
Microsoft (R) Visual Basic Compiler version 10.0.30319.1
Copyright (c) Microsoft Corporation. All rights reserved.

Elapsed: 00:00:00.0763237
In milliseconds: 76
In timer ticks: 198265

Unless I’m reading that wrong…that’s faster than -O3

Been a while since I’ve used mono – but I’m pretty sure it has /optimize…

Bob Foster November 22, 2010 at 03:47

Actually fascinating. Most of us wish we had a bud like Mat to challenge us!

exaggeration, but it was quite good November 22, 2010 at 05:07

This is the greatest blog post of all time.

Bruno November 22, 2010 at 10:40

Of course it does. JIT compilation is not a 0-time expenditure. However, if you make a very slight change (already rocking that the change is so slight) and export a .exe file…then compare that to optimized C. It’ll be fairly close, maybe 2x (optimized C vs optimized .NET plus overhead). Both will end up running a “save then print a precompiled constant”, of course.

That you can get .NET optimization so free is something pretty awesome. I didn’t see in the article anything showing which optimization options were used, though I did see that none were in comments, and my other comments here reflect that.

Of course 100% optimized C code will outrun any VM. Can OP give us a fully optimized control for reference, anyway? On the box that ran the other tests?

E: And about your edit…I THOUGHT so. I had trouble finding anything for sure, but I was pretty damn sure we could unroll the loop when the Delegate compiled.

Duran November 22, 2010 at 10:41

Technically, when you turn source code into a parse tree (or other internal representation) it’s no longer a pure interpreter.

If that’s too technical, how about this: Compiling to bytecode is still compiling, because the only difference between bytecode and machine code is whether there is a (possibly microcoded) hardware implementation.

Bill November 23, 2010 at 05:33

In simvm-slow.c, what’s the rationale for using:

#define NEXT() __asm__(“jmp *%0″::”r”((++ip)->jmp)); goto *ip->jmp

instead of

#define NEXT() goto *(++ip)->jmp


A.T. November 23, 2010 at 09:22

love that assumption “world is 99% built on x86″, caused mainly by illiteracy about any other platforms or refuse to admit that all their *elegance* will go down pipes once assumption is removed… :)))

Bored November 24, 2010 at 01:51

I got bored and decided to see if, by your logic, C was faster than Assembly too!

Here is my implementation of your VM.

Basically, it build a C file, calls gcc, and runs the executable.

$ time ./bench # original benchmark code

real 0m0.396s
user 0m0.390s
sys 0m0.003s

$ time ./simvm # your asm version

real 0m0.759s
user 0m0.760s
sys 0m0.000s

$ time ./rofl_vm # my version

real 0m0.208s
user 0m0.143s
sys 0m0.033s


Note: simvm and simvm were built using -O4 and bench was built with -O0

I would have tested the VB version but mono kept crashing on CreateDelegate.

P.S. I agree with your point, I just wanted to post that for comedic effect.

snify November 24, 2010 at 18:09

I’ve written a similar thing, but the opcode generation was more generic using an assembler. Nice source anyway.

undef November 26, 2010 at 10:50

hahaha, interesting article

Comments on this entry are closed.

Previous post: