SSE4.2 and the new CRC32 instruction

by bysin on 2010-10-13
Programming

For those of you who have the new Nehalem processor from Intel, there’s an interesting new instruction that is used to speed up calculating checksums called CRC32. This instruction is part of the SSE4.2 set, and just like most SSE instructions, its fairly useless. But I just spent my hard earned money on a new processor and I’ll be damned if I don’t get my moneys worth, so here’s my evaluation of CRC32.

We’ll start off with a standard 32-bit checksum function:

uint32_t slowcrc_table[1 << 8];

void slowcrc_init() {
	uint32_t i, j, a;

	for (i = 0; i < (1 << 8); i++) {
		a = ((uint32_t) i) << 24;
		for (j = 0; j < 8; j++) {
			if (a & 0x80000000)
				a = (a << 1) ^ 0x11EDC6F41;
			else
				a = (a << 1);
		}
		slowcrc_table[i] = a;
	}
}

uint32_t slowcrc(char *str, uint32_t len) {
	uint32_t lcrc = ~0;
	char *p, *e;

	e = str + len;
	for (p = str; p < e; ++p)
		lcrc = (lcrc >> 8) ^ slowcrc_table[(lcrc ^ (*p)) & 0xff];
	return ~lcrc;
}

Not including the table setup, the standard checksum function took 0.30 seconds to process a random 64 MB string. Unfortunately, the compiler I’m using currently doesn’t support SSE4.2 instructions, so I’m forced to write the hardware checksum function in byte code.

uint32_t fastcrc(char *str, uint32_t len) {
	uint32_t q = len / sizeof(uint32_t),
		r = len % sizeof(uint32_t),
		*p = (uint32_t*) str, crc;

	crc = 0;
	while (q--) {
		__asm__ __volatile__(
			".byte 0xf2, 0xf, 0x38, 0xf1, 0xf1;"
			:"=S" (crc)
			:"0" (crc), "c" (*p)
		);
		p++;
	}

	str = (char*) p;
	while (r--) {
		__asm__ __volatile__(
			".byte 0xf2, 0xf, 0x38, 0xf0, 0xf1"
			:"=S" (crc)
			:"0" (crc), "c" (*str)
		);
		str++;
	}

	return crc;
}

The hardware accelerated checksum instruction processed the same 64 MB of random data in 0.05 seconds. That’s around 6 times faster then the standard checksum function.

Function Average Exec. Time
slowcrc 0.303066 seconds
fastcrc 0.052982 seconds

{ 17 comments }

AJ October 13, 2010 at 11:24

SSE instructions are not useless, they are used in video/image decoding, math formulas and other related algorithms. Even your post shows that SSE isn’t useless, as the CRC32 instruction is 6 times faster than raw C.

Nutless critic October 13, 2010 at 11:48

Testing with a 64-megabyte block, you’re mostly benchmarking memory bus speed. Try again with block sizes that fit in the L1 cache, ones that fit in the L2 cache, and don’t forget to pre-warm your cache first and account for benchmark noise.

Before you publish your next half-baked six-figure precision benchmark article, anyway.

Architecture Hacker October 14, 2010 at 05:56

I think the article was meant to be interesting and not a 100% accurate benchmark. Also, that isn’t benchmarking memory bus speed if both of them are 64MB now is it? I mean if you test it in cache you can argue that you’re testing cache line size and n-set associativity depending on the size of the block.

I agree it would be good to do all different block sizes and average the results but… to say it’s only benchmarking memory bus speed is unfair.

anonymous October 13, 2010 at 12:37

Do this a hundred times, 0.3s is too short for a precise measurement.

Ivan October 13, 2010 at 12:44

Would be nice to have TIGER computational instruction.

Essential for any file synchronization and p2p activities involving untrusted parties.

Anon Commenter October 13, 2010 at 12:57

Why do you write code like that, with no spaces? I find it very hard to read, even after years of having to deal with it in other people’s code. Use some damn spaces, man.

Trail October 13, 2010 at 15:36

Sweet. Thanks for doing this experiment!

solrize October 13, 2010 at 16:21

It’s pretty crazy that the instruction only works with that one specific polynomial instead of letting you load your own. There’s lots of different 32-bit crc’s in use for various protocols, and this instruction can handle just one of them.

Dude October 13, 2010 at 19:17

The main goal is probably power reduction for server applications, not speedup. A hard-coded polynomial is probably an order of magnitude more power efficient (no muxing in the data path).

Zach October 13, 2010 at 16:55

So wait… It’s 6 times faster, but somehow it’s fairly useless? I don’t get it.

dar7yl October 13, 2010 at 17:51

The question is, did the machine instruction produce the same answer as the checksum function?

Douglas October 13, 2010 at 23:09

To answer various criticisms:
“fairly useless”: it is useless unless you want this particular niche CRC. That is, unless you have SCSI disks, AFAICT.

“0.3s is too short; do it 100 times”: the table is headed “Average Exec. Time”

“you’re mostly benchmarking memory bus speed”: then how does it speed up 6 times?

“no spaces”: Agreed! but nevermind.

Chris Chiesa October 14, 2010 at 00:55

Just FYI, Digital Equipment Corporation’s VAX processors had a CRC instruction back in the late 1970s… it’s nothing new.

Andrew October 14, 2010 at 01:04

Any chance you could look at the string processing and publish another blog post for those? strlen/strchr

Nick October 14, 2010 at 02:44

Wow, lots of negativity in the comments. I, for one, appreciate the article and the insight, whether or not it’s 100% accurate, the results seems statistically significant.

dna October 27, 2010 at 15:04

Very nice information.

dna October 27, 2010 at 17:14

I found your blog on google and read a few of your other posts. I just added you to my Google News Reader. Keep up the good work Look forward to reading more from you in the future.

Comments on this entry are closed.

{ 2 trackbacks }

Previous post:

Next post: