Mat and I were scanning through github one day and a pretty lengthy, complex piece of code caught our eye (Caution: Do not read if you’re prone to seizures or have a heart condition). This code is one of the many intricacies involved in mono’s bytecode interpreter, and it was beautiful, at least to us. Why must it be so complex? How hard can it be? After a lengthy discussion, we decided the best thing to do at this point was have a competition to see which one of us can write the fastest VM bytecode interpreter in two hours. A few insults later (“Your mother’s filesystem is so fat etc.”), we decided to set a few ground rules and agree on a benchmark.

I came up with a pretty simple piece of code that contains addition, multiplication, and conditional branches. I then generated a generic bytecode equivalent of the benchmark to be used in our interpreter.

Benchmark in C Benchmark in Pseudo-bytecode
int s, i, j;
for (s = 0, i = 0; i < 10000; i++) {
	for (j = 0; j < 10000; j++)
		s += i * j;
}
printf("%d\n", s);
0: load 0
1: store r0
2: load 0
3: store r1
4: jmp 26
5: load 0
6: store r2
7: jmp 18
8: load r0
9: load r1
10: load r2
11: mul
12: add
13: store r0
14: load r2
15: load 1
16: add
17: store r2
18: load r2
19: load 10000
20: cmp
21: blt 8
22: load r1
23: load 1
24: add
25: store r1
26: load r1
27: load 10000
28: cmp
29: blt 5
30: eof

As a control, the C benchmark runs on the native machine at an average time of 0.233 secs. In my first attempt, I wrote a simple C program that reads each instruction and jumps to a corresponding block of code.

...
int vm(inst *i) {
	static void *optable[]={
		[OP_NOP] = &&op_nop,	[OP_LDI] = &&op_ldi,	[OP_LDR] = &&op_ldr,
		[OP_STO] = &&op_sto,	[OP_ADD] = &&op_add,	[OP_SUB] = &&op_sub,
		[OP_MUL] = &&op_mul,	[OP_DIV] = &&op_div,	[OP_MOD] = &&op_mod,
		[OP_ORR] = &&op_orr,	[OP_XOR] = &&op_xor,	[OP_AND] = &&op_and,
		[OP_SHL] = &&op_shl,	[OP_SHR] = &&op_shr,	[OP_NOT] = &&op_not,
		[OP_NEG] = &&op_neg,	[OP_CMP] = &&op_cmp,	[OP_BEQ] = &&op_beq,
		[OP_BNE] = &&op_bne,	[OP_BGT] = &&op_bgt,	[OP_BLT] = &&op_blt,
		[OP_BGE] = &&op_bge,	[OP_BLE] = &&op_ble,	[OP_CAL] = &&op_cal,
		[OP_JMP] = &&op_jmp,	[OP_RET] = &&op_ret,	[OP_EOF] = &&op_eof
	};
	int r[4], s[32], *sp = s;
	inst *ip = i;
	...
	op_nop:                           goto *(++ip)->jmp;
	op_ldi:     *sp++ = ip->arg;      goto *(++ip)->jmp;
	op_ldr:     *sp++ = r[ip->arg];   goto *(++ip)->jmp;
	op_sto:     r[ip->arg] = *--sp;   goto *(++ip)->jmp;
	op_add:     sp--, sp[-1] += *sp;  goto *(++ip)->jmp;
	op_sub:     sp--, sp[-1] -= *sp;  goto *(++ip)->jmp;
	op_mul:     sp--, sp[-1] *= *sp;  goto *(++ip)->jmp;
	...
}

Click here to download the Full Source Code.

This works by creating an array of goto pointers (I believe this is a gcc extension), a pseudo-stack, and a list of registers, then jumping to each instruction while increasing the instruction pointer. This simple virtual machine executed the bytecode in 6.421 secs, which was way too slow for my taste, so I had to figure out another approach.

Why don’t I just compile the bytecode into x86 machine code, like modern JIT VMs? That could easily be my ticket to victory. I had about an hour left in the competition so I made haste. I began to replace the optable full of goto addresses into x86 instructions, then allocated some executable memory, copied the instructions, and jumped to it.

...
int vm(inst *i) {
	struct {
		int32_t size, arg, jmp;
		char data[16];
	} *op, optable[]={
		INS(op_nop, 0, 0, 0, 0x90),
		INS(op_ldi, 4, 1, 0, 0x68),
		INS(op_ld0, 0, 0, 0, 0x53),
		...
		INS(op_add, 0, 0, 0, 0x58, LONG 0x01, 0x04, 0x24),
		INS(op_sub, 0, 0, 0, 0x58, LONG 0x29, 0x04, 0x24),
		INS(op_mul, 0, 0, 0, 0x5a, LONG 0x8b, 0x04, 0x24, LONG 0x0f,
			0xaf, 0xc2, LONG 0x89, 0x04, 0x24),
		...
		INS(op_ble, 4, 0, 2, 0x0f, 0x8e),
		INS(op_cal, 4, 0, 1, 0xe8),
		INS(op_jmp, 4, 0, 1, 0xe9),
		INS(op_ret, 0, 0, 0, 0xc3),
		...
	};
	...
	if (!(pn = mmap(0, m, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANON, -1, 0)))
		return 0;
	...
	((void(*)()) pn)();
	printf("%d\n", r0);
	return 0;
}

Click here to download the Full Source Code.

On runtime this created a small, 122 byte x86 program based upon the benchmark bytecode which clocked in at an average speed of 0.518 secs. This was only around twice as slow as the control so I was fairly confident at this point.

I slickly inquired into what Mat was working on, and he informed me he was writing his bytecode interpreter in Visual Basic.NET. I was a bit skeptical at first, considering he did not know Visual Basic, but was reassured he wasn’t joking. Evidently he taught himself Visual Basic in the span of 2 hours to what amounts to be the ultimate coding troll. He’s not one to lose these competitions, so I assumed he has some trick up his sleeve. He submitted his code for approval:

	...
	For ip = 0 to ops.Length - 1
		Dim i as VMI = ops(ip)
		il.MarkLabel(jmp(ip))
		Select Case i.Opcode
			Case "ldi"
				il.Emit(Opcodes.Ldc_i4, i.Operand)
			Case "ldr" 
				il.Emit(Opcodes.Ldloc, i.Operand)
			Case "sto"
				il.Emit(Opcodes.Stloc, i.Operand)
			Case "jmp"
				il.Emit(Opcodes.Br_S, jmp(i.operand))
			Case "mul"
				il.Emit(Opcodes.Mul)
			Case "add"
				il.Emit(Opcodes.Add)
			Case "eof"
				il.Emit(Opcodes.Ldloc, 0)
				il.Emit(Opcodes.Ret)
			Case "cmp"
				Select Case (ops(ip+1).opcode)
					Case "blt"
						il.Emit(Opcodes.Blt, jmp(ops(ip+1).operand))
					Case Else
						Console.WriteLine("unsupported branch: {0}", ops(ip+1).opcode)
				End Select
				ip = ip + 1
			Case Else
				Console.WriteLine("unsupported opcode: {0}", i.Opcode)
		End Select
	Next

	Console.WriteLine("{0}", _
		CType(program.CreateDelegate(GetType(tmplP1(Of Long, Integer))), tmplP1(Of Long, Integer))(0))
	...

Click here to download the Full Source Code.

Once compiled, his code ran the benchmark at an average speed of 0.127 secs….. Wait, what?

# vbnc simvm.vb && time mono simvm.exe
857419840

real	0m0.127s
user	0m0.120s
sys	0m0.000s

I wouldn’t have believed it if I didn’t see it myself. My code generates native Assembly… Assembly! And his is written in Visual Basic. I’m sure there is some trickery going on, like mono optimizing the emitted instructions, but I haven’t as of yet ruled out witchcraft. I was forced to conclude that Visual Basic is faster than Assembly, that I’m a horrible coder, and Mat wins.

Program Benchmark Time
Control 0.233 secs
Ben’s C/x86 ASM VM 0.518 secs
Mat’s Visual Basic VM 0.127 secs

 

UPDATE: Its been mentioned that I didn’t compile the control with optimization on. I turned optimization off because gcc is way too damn smart. It almost literally translated the code into ‘printf(“857419840\n”);’. I think a better example would be if we didn’t give gcc the answer on compile-time, since none of the VM’s were given that opportunity until it read the instructions on run-time. The VM’s did not, and could not know ahead of time the loop amount, or even the general flow of the bytecode for that matter. So by saving the loop amount in a variable declared as volatile, you prevent gcc from optimizing it out:

#include <stdio.h>

int main(int argc, char **argv) {
    volatile int k = 10000;
    int s, i, j;
    for (s = 0, i = 0; i < k; i++) {
        for (j = 0; j < k; j++)
            s += i * j;
    }
    printf("%d\n", s);
    return 0;
}

That code compiled with -O3 runs at 0.085 sec on my machine. Surprisingly its only 66% faster then the Visual Basic example.

{ Comments on this entry are closed }

Free Content Delivery Network using DNS cache

October 27, 2010

Why spend money on expensive CDN hosting when there’s a perfectly good, free, global one available? Thats right, DNS cache. Most open recursive DNS servers will cache requests (A, CNAME, PTR, TXT, etc.) for the length of the specified TTL value, and there’s millions of them worldwide. Once a public DNS server has the records […]

Read the full article →

SSE4.2 and the new CRC32 instruction

October 13, 2010

For those of you who have the new Nehalem processor from Intel, there’s an interesting new instruction that is used to speed up calculating checksums called CRC32. This instruction is part of the SSE4.2 set, and just like most SSE instructions, its fairly useless. But I just spent my hard earned money on a new […]

Read the full article →

Containers, Templates, Lambda Expressions in C

October 12, 2010

Not many people know this, but the C language (with the help of gcc extensions) can support templates and lambda expressions. I know I’m going to get emails / comments about how I butchered the C language. So let me start out with a word of caution, this is for educational use only, and is […]

Read the full article →