For those of you who have the new Nehalem processor from Intel, there’s an interesting new instruction that is used to speed up calculating checksums called CRC32. This instruction is part of the SSE4.2 set, and just like most SSE instructions, its fairly useless. But I just spent my hard earned money on a new processor and I’ll be damned if I don’t get my moneys worth, so here’s my evaluation of CRC32.

We’ll start off with a standard 32-bit checksum function:

uint32_t slowcrc_table[1 << 8];

void slowcrc_init() {
  uint32_t i, j, a;

  for (i = 0; i < (1 << 8); i++) {
    a = ((uint32_t) i) << 24;
    for (j = 0; j < 8; j++) {
      if (a & 0x80000000)
        a = (a << 1) ^ 0x11EDC6F41;
      else
        a = (a << 1);
    }
    slowcrc_table[i] = a;
  }
}

uint32_t slowcrc(char *str, uint32_t len) {
  uint32_t lcrc = ~0;
  char *p, *e;

  e = str + len;
  for (p = str; p < e; ++p)
    lcrc = (lcrc >> 8) ^ slowcrc_table[(lcrc ^ (*p)) & 0xff];
  return ~lcrc;
}

Not including the table setup, the standard checksum function took 0.30 seconds to process a random 64 MB string. Unfortunately, the compiler I’m using currently doesn’t support SSE4.2 instructions, so I’m forced to write the hardware checksum function in byte code.

uint32_t fastcrc(char *str, uint32_t len) {
  uint32_t q = len / sizeof(uint32_t),
    r = len % sizeof(uint32_t),
    *p = (uint32_t*) str, crc;

  crc = 0;
  while (q--) {
    __asm__ __volatile__(
      ".byte 0xf2, 0xf, 0x38, 0xf1, 0xf1;"
      :"=S" (crc)
      :"0" (crc), "c" (*p)
    );
    p++;
  }

  str = (char*) p;
  while (r--) {
    __asm__ __volatile__(
      ".byte 0xf2, 0xf, 0x38, 0xf0, 0xf1"
      :"=S" (crc)
      :"0" (crc), "c" (*str)
    );
    str++;
  }

  return crc;
}

The hardware accelerated checksum instruction processed the same 64 MB of random data in 0.05 seconds. That’s around 6 times faster then the standard checksum function.

Function Average Exec. Time
slowcrc 0.303066 seconds
fastcrc 0.052982 seconds