?

Log in

No account? Create an account

No Two Hash are Equal

« previous entry | next entry »
Feb. 8th, 2008 | 10:39 am

I have almost always found that the Hsieh alogrithm for hashing turns out to be the fast. So this morning while waiting for a compile I popped it into Memcached to see what the improvement would be:

With Hsieh:
Testing generate_pairs 2.396 [ ok ]
Testing generate_data 5.100 [ ok ]
Testing get_read 5.491 [ ok ]
Testing delete_generate 4.587 [ ok ]
Testing generate_buffer_data 1.361 [ ok ]
Testing delete_buffer 0.880 [ ok ]
Testing generate_data 4.984 [ ok ]
Testing mget_read 1.817 [ ok ]
Testing mget_read_result 1.743 [ ok ]
Testing mget_read_function 1.782 [ ok ]
Testing cleanup 0.046 [ ok ]
Testing generate_large_pairs 0.390 [ ok ]
Testing generate_data 39.899 [ ok ]
Testing generate_buffer_data 0.058 [ ok ]
Testing cleanup 0.001 [ ok ]

With default:
Testing generate_pairs 2.405 [ ok ]
Testing generate_data 5.081 [ ok ]
Testing get_read 5.399 [ ok ]
Testing delete_generate 4.494 [ ok ]
Testing generate_buffer_data 1.284 [ ok ]
Testing delete_buffer 0.894 [ ok ]
Testing generate_data 4.941 [ ok ]
Testing mget_read 1.793 [ ok ]
Testing mget_read_result 1.785 [ ok ]
Testing mget_read_function 1.748 [ ok ]
Testing cleanup 0.037 [ ok ]
Testing generate_large_pairs 0.390 [ ok ]
Testing generate_data 39.897 [ ok ]
Testing generate_buffer_data 0.058 [ ok ]
Testing cleanup 0.001 [ ok ]

It is slower!

Memcached uses the one published by Bob Jenkins (http://burtleburtle.net/bob/hash/doobs.html), and Paul Hsieh wrote originally that his algorithm was faster then Jenkins' (http://www.azillionmonkeys.com/qed/hash.html).

Hmmm something smells fishy...

Link | Leave a comment | Share

Comments {7}

Tanjent

(no subject)

from: tanjent
date: Feb. 8th, 2008 08:25 pm (UTC)
Link

You might find this interesting - http://home.comcast.net/~bretm/hash/

I was playing around with crypto and mixing functions a while back and found that the best bang for the buck came from a combination of rotates and integer multiplies - something like

unsigned int a = myvalue1;
unsigned int b = myvalue2;

for(int i = 0; i < reps; i++)
{
	a *= bigprime1;
	a = _rotl(a,shift1);
	b += a;

	b *= bigprime2;
	b = _rotl(b,shift2);
	a += b;
}


don't remember what i worked out for the constants though....

Reply | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Feb. 8th, 2008 08:31 pm (UTC)
Link

How good is the distribution?

Reply | Parent | Thread

Tanjent

(no subject)

from: tanjent
date: Feb. 8th, 2008 08:48 pm (UTC)
Link

It avalanched fine, so distribution should be as good as any other avalanchey mixing function. Lemme see if I can dig it up.

Reply | Parent | Thread

Поисковик-затейник

(no subject)

from: itman
date: Feb. 8th, 2008 08:57 pm (UTC)
Link

You never know if this improvement is data-set specific.

Reply | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Feb. 9th, 2008 06:52 pm (UTC)
Link

Yep, the data is pretty random... but the key sizes are pretty much dead on.

Reply | Parent | Thread

Поисковик-затейник

(no subject)

from: itman
date: Feb. 9th, 2008 07:37 pm (UTC)
Link

Yep, the data is pretty random
All random numbers are born equally random, but some are more random than others :-). I mean, that different probability distributions might yield very different results.
but the key sizes are pretty much dead on.
I am not sure that I understand what is the issue with key sizes.

Reply | Parent | Thread

Tanjent

(no subject)

from: tanjent
date: Feb. 9th, 2008 12:54 am (UTC)
Link

So, I've been playing around with this today and I think I can make a few guesses about the results you're seeing -

1. The Hsieh hash is actually faster, but you're spending more time overall due to hash collisions.

2. The keys you're hashing are very similar to each other ("aaaab", "aaaba", "caaab", etc.)

Here's what I think is happening -

Suppose I've got a mix function M that scrambles bits in a block B to produce B'. If M is a strong mixing function, it'll flip each bit in B with a probability of around 0.5 no matter what B is. If M is weak, it may only flip bits with a probability of 0.01 - however, if I repeatedly apply M to a block a large number of times, it can be the case that the cumulative result eventually flips bits with a probability near 0.5.

Now suppose I build a hash function like this -

Block Hash ( Block * key, int count )
{
	Block state = 0;

	// Compute the hash

	for(int i = 0; i < count; i++)
	{
		state ^= key[i];

		state = Mix(state);
	}

	// Post-hash scrambling

	for(int i = 0; i < 1000000; i++)
	{
		state = Mix(state);
	}


	return State;
}


Even if Mix is weak, the hash function as a whole will exhibit good avalanche properties due to the post-hash scrambling - if I flip a single bit in the key, the scrambling will eventually flip around half the output bits.

But, there's a catch -

Suppose I take a key K and make a copy K2. Then I flip one bit in the first block of K2 and two bits in the second block. Now suppose the mix function is so weak that on average it only flips 1 bit of the state. The chance that the hashes of K and K2 will collide is actually quite high, especially if the block size is small - before the first mix the two hashes differ by 1 bit, after the first mix the two may differ by only 2 bits (since the mix is weak), and the two bits flipped in the second block have a chance to "cancel out" that difference. If the block size is 32 bits, that chance is 1 in (32*31)/2 or so. If the mix function had been a bit stronger and flipped 3 bits on average, then the chance of flipping 4 bits in K2[1] canceling out the difference would've been around 1 in (32*31*30*29)/(4*3*2), etcetera.

The Hsieh hash tries to improve performance by using a much, much weaker mix function and then compensating for it with some post-hash fixups - if the keys are randomish to begin with this isn't a problem, but if they're all quite similar they'll tend to collide more.

-tanjent

Reply | Thread