Brian "Krow" Aker (krow) wrote,
Brian "Krow" Aker

Thoughts on hashes.... libmemcached...

Inside of libmemcachd (aka the library behind the memcache engine for MySQL) I have a big stack of hash algorithms you can pick to use. libmemcached splits hashing into two concepts.

  • Hash used for key distribution
  • Algorithm used for distribution

    For the hashing there is a default method that I use. It is pretty generic in nature. There are two other primary methods are MD5 and Hsieh.

    So what does the performance look like for each of these?
    Testing generate_data                           14.183
    Testing get_read                                     11.144
    Testing mget_read	                                 1.992
    Testing mget_read_result                      1.911

    Testing generate_data			         13.880
    Testing get_read					 10.457
    Testing mget_read					 1.769
    Testing mget_read_result			 1.859

    Testing generate_data		                14.634
    Testing get_read				        12.253
    Testing mget_read				        1.953
    Testing mget_read_result			1.926

    The not so surprising thing is that MD5 really is the worst performer. Hseih hashing has been documented as outperforming most methods so it is no great surprise that it is faster.

    It is surprising to me that the malloc() fetch method is behaving as well, and sometimes better, as the result method. Something is clearly up with that (seeing how I have spent zero time optimizing that bit of code I am not surprised). It could also be that because this is a single threaded test that the cost of malloc() really is negligible.

    For distribution only two methods are supported at the moment. The first is a simple modulo method and the second being a consistent distribution. The cost of each?

    Hsieh Modulas
    Testing generate_data				14.188
    Testing get_read					10.700
    Testing mget_read					1.831
    Testing mget_read_result			1.626

    Hsieh consistent
    Testing generate_data			        14.121
    Testing get_read					10.655
    Testing mget_read					1.832
    Testing mget_read_result			1.603

    The cost for picked one method over the other is zero. Right now the consistent distribution has a built in compiled limit of 1024 servers (largest cluster I know of right now is right around 200 servers with an average of 600 connections (bigger may exist, I do not often ask)). The big advantage with the consistent hash is that node addition has a low cache miss rate (which will drop to zero once the replica code is completed). I think that there is a strong argument in these numbers to change the default method over to consistent distribution. The modulas method has an advantage that no consistent calculation has to be made (which is probably going to be significant when you get into the many thousands of servers).

    One other note about current performance. What if I had used the asynchronous protocol to generate the above information?

    Testing generate_data					 3.253

    Clearly asynchronous writes are much faster. For reads I am not seeing this but them mget uses a pipeline method to send keys so the differences between the two should not be that extreme.

    If you want to run these tests on your own, just run:
    ./test/testapp generate

    Or if using code from the repository:
    ./test/testapp generate*

    Version .12 has more tests related to performance in it, hence the * will pick those out of what is run for each "make test".

    Have fun!
  • Subscribe
    • Post a new comment


      Comments allowed for friends only

      Anonymous comments are disabled in this journal

      default userpic

      Your reply will be screened

      Your IP address will be recorded