Myths, GUID vs Autoincrement

« previous entry | next entry »
Mar. 10th, 2007 | 05:56 pm

I am a fan of using GUIDs as primary keys. They are database
independent, object independent and I can either create them in my
application or in the database (MySQL since 5.0 has had a UUID()
function but since it does not replicate I don't use it normally...
this has been fixed in 5.1 though).

What I have heard for years was that Innodb didn't handle them very
well. I have had my doubts, but never taken the time to see if this
was true or not.

This week I finally got around to putting this to a test.

/mysqlslap --concurrency=1,50,100,150,200,250,300 --iterations=10 --
number-int-cols=2 --number-char-cols=3 --auto-generate-sql --auto-
generate-sql-add-autoincrement --auto-generate-sql-load-type=update --
auto-generate-sql-execute-number=10000 --auto-generate-sql-unique-
write-number=10 --auto-generate-sql-write-number=100000 --csv=/tmp/
innodb-auto.csv --engine=blackhole,innodb

So 10K of rows, with increasing numbers of threads making
modifications. Unlike a lot of tests I do this is a growth test...
aka as a site would grow this is what you would see. Normally when I
look at how SMP machines handle problems I am more interested in
seeing how it would split the problem with more threads and more CPU.
In this case I wanted to look at just how the database would behave
as a site grew.

The test I am looking at is primary key focused, and its all updates.
So a read is occurring and a write follows up.

As you can tell from the graph, the overhead, as seen via blackhole,
is nearly identical in cases where UUID() was used and Autoincrement
was used. The only issue I ran into was that Innodb started getting
deadlocks when I cranked up the connections to 300 on my quad PPC
running Ubuntu. I am going to play with this a bit further and see if
Innodb would have similar issues on a dual processor. The binlog was
not enabled during any of these tests so the row locks that we see
for updates when running with statement based replication won't be
seen (but I really want to see how that effects things!).

Just to confirm results I am going run the same test but just do key
lookups and see if any difference pops up. I am doubting it will but
you never know until you try :)
Update Auto vs GUID Innodb.jpg

Link | Leave a comment | Share

Comments {32}

Lover of Ideas

(no subject)

from: omnifarious
date: Mar. 11th, 2007 02:16 am (UTC)
Link

That's good to know. I've never really thought of UUIDs as database keys, but it makes a whole lot of sense now that you mention it. UUID's are the strength of monotone, Mercurial and various other version control systems of the same ilk. Versions are essentially named by UUID so when time comes to merge to different version histories the version IDs are all unique and it's almost trivial.

Reply | Thread

Lover of Ideas

(no subject)

from: omnifarious
date: Mar. 11th, 2007 02:17 am (UTC)
Link

Maybe I should patch SQLObject to support UUIDs for id fields instead of auto-increment fields.

Reply | Parent | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Mar. 11th, 2007 03:22 am (UTC)
Link

What we, as in MySQL, should do is support these natively... just need to get around to doing a patch to make this happen.

Reply | Parent | Thread

inanna

(no subject)

from: inanna
date: Mar. 11th, 2007 05:27 am (UTC)
Link

You know... i know all the words you use... i even know some of the alphabet soup... but the way you put them together makes my brains fuzzy. But i read every word and look up the things i don't know. Someday i'll understand it. Or maybe my head will explode. At this point it is a toss-up, but i am leaning towards the latter option as being more likely. :)

Reply | Thread

Not your god

(no subject)

from: notagod
date: Mar. 14th, 2007 12:02 pm (UTC)
Link

Is it sad that I implicitly understand, have recently been contemplating such a move, and am interested in this post (and am a walkin from elsewhere) :)

Reply | Parent | Thread

inanna

(no subject)

from: inanna
date: Mar. 14th, 2007 04:22 pm (UTC)
Link

No... it isn't sad... it is a sign that the obsessions shared by some of my friends are not anomalous, but rather a new form of conventionality.

Reply | Parent | Thread

(no subject)

from: jamesd
date: Mar. 11th, 2007 05:52 am (UTC)
Link

How many times bigger than the RAM in the server was the table? Effectively random inserts to a working set bigger than buffer pool (with little OS caching RAM available also) is the main issue.

A related issue is that with the changes evenly spread and all fitting in RAM, InnoDB can try to flush a lot of them to disk at the same time because of the last modified time similarity, creating a load surge and unresponsive server. Fixing this is on Heikki's to do list.

After that comes the extra size of the UUID in the secondary indexes and the reduced number of entries that fit in RAM as a result.

Reply | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Mar. 11th, 2007 07:35 am (UTC)
Link

Hi!

Going from 8 bytes to 32 bytes is significant, and yes having a native type would decrease the foot print... but I don't think this matters that much. Yes, its more memory, but when I look at the advantages for certain applications... I'll take the memory hit.

Personally I am look for scale solutions for everything I am doing. Less concerned about memory issues for the primary key... more concerned about how I can add machine after machine and not have to be concerned about reshuffling key solutions.

Reply | Parent | Thread

(no subject)

from: jamesd
date: Mar. 11th, 2007 10:11 am (UTC)
Link

I agree with all you wrote.

How big was the 10,000 row table, in GB? How big was the InnoDB buffer pool? How much RAM in the system?

Reply | Parent | Thread

Don't use UUID... use SHA1

from: burtonator
date: Mar. 11th, 2007 08:31 am (UTC)
Link

Don't use UUID. use truncated SHA1.....

You can truncate the first 64 bits and get 2^32 entries per BOX before you have a collision.... Then you can route queries and build out federations.

Kevin

Reply | Thread

Brian "Krow" Aker

Re: Don't use UUID... use SHA1

from: krow
date: Mar. 11th, 2007 08:59 am (UTC)
Link

Heh, goes to show how different our applications are :)

My crawler is well past 2^32 and growing on individual boxes.

Now could Slashdot do 2^32 and get away with it? Completely. Comments+stories+users is no where near this. But something like LiveJournal, or Technorati? Not even enough.

I'm big on being able to move data around with out having to rekey it... make it unique from the beginning so that I never have to worry when I partition that I could get collisions (and yes, I know its theoretically possible with UUID but I have never seen it happen).

One common mistake I see with sites is that they from the beginning create keying nightmares for themselves. There are a number of big sites that still rely on one master database in the core of their architecture. I really think that is a bad idea.

Reply | Parent | Thread

Re: Don't use UUID... use SHA1

from: jamesd
date: Mar. 11th, 2007 10:32 am (UTC)
Link

Not just big either. It's a fairly common theme to see people hitting the single box wall because they didn't architect so it wasn't there. Maybe we should promote one of the VM solutions so people know they can put a multi-server scale-out architecture on one server until they grow. :)

Reply | Parent | Thread

Re: Don't use UUID... use SHA1

from: burtonator
date: Mar. 11th, 2007 10:52 am (UTC)
Link

Seriously. you have tables with 4B rows?

If you're not worried about UUID collision then there's no reason you should worry about hash collision.

At least with hashes you can route the data.

And yes... with Live journal and Technorati it's enough. Google uses this technique.

But yeah.... using a centralized sequence generator is a REALLY bad idea.

Reply | Parent | Thread

Brian "Krow" Aker

Re: Don't use UUID... use SHA1

from: krow
date: Mar. 11th, 2007 04:50 pm (UTC)
Link

Yes, I do, and yes it works. I would need to double check this, but if you look at archive_test in storage/archive you will see that it generates this many rows on its long test.

I can think of a couple of MySQL customers who hit the 4 billion mark sometime ago (since I remember fixing the bugs a few years ago!). I've never tested MySQL to the full 2^64 though. The crash-me framework might, I've not looked at it in a while.

LJ I believe uses a central key generator, not sure though but I thought I remember this being the case.

You really think that 2^32 is enough? Well the one thing with that though is that you won't be able to mix your data sets from multiple machines and not fear a class.

Reply | Parent | Thread

........

from: burtonator
date: Mar. 11th, 2007 08:31 am (UTC)
Link

Oh..and DEF thanks for these graphs :)

Reply | Thread

(no subject)

from: erikwett
date: Mar. 11th, 2007 02:39 pm (UTC)
Link

I have had some serious performance problems with GUID's in MS SQL Server environment.

The scenario was a log table with GUID as the primary key and 3-4 secondary indexes. Since SQL Server by default clusters a table on the primary key and the secondary indexes contain the primary key values the result was that indexes took very much space.

I reorganized the table with a numeric auto-generated key instead and it shrunk by about 50%, which improved throughput a lot. Since the table was a log table with no need for global id's this was not a problem.

I believe InnoDb use the same structure with tables clustered on the primary key, which would make this relevant. Are there any secondary indexes in your table?

Erik

Reply | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Mar. 11th, 2007 04:53 pm (UTC)
Link

No, this test was just about primary key lookup performance. From your description I would say that yes the two look similar in design.

It was a log table? I take it that you couldn't create a composite key from multiple attributes?


But no, this was not about secondary indexes, just about primary. The customer that got me thinking about this just uses primary indexes, not secondary.

Reply | Parent | Thread

(no subject)

from: jamesd
date: Mar. 12th, 2007 05:18 am (UTC)
Link

InnoDB shares the issue. Designing for cache efficiency matters a lot if you're after the best performance.

Reply | Parent | Thread

Абу Антось

(no subject)

from: syarzhuk
date: Mar. 25th, 2007 01:24 am (UTC)
Link

Why not cluster on datetime column?

Reply | Parent | Thread

midom

location matters

from: midom
date: Mar. 12th, 2007 01:46 pm (UTC)
Link

inserting id after id provides one with records grouped together by their entry date. that means more efficient time-wise range scans and less fragmentation / page splits.

of course, inverse UUID (with host/time being in front, and random bits somewhere back) could be quite efficient. but putting random value leftmost is a certain no-no.

Reply | Thread

peter_zaitsev

(no subject)

from: peter_zaitsev
date: Mar. 13th, 2007 07:33 pm (UTC)
Link

Brian,

I think you're missing the point of UUID vs auto_increment

When you're working with tiny data set this is simply not where overhead is. When data set is much larger than amount of memory you get a lot of problems with UUID because they are inserted in random BTREE location which is a lot of random IO.

I'll post some benchmarks about it soon :)

Reply | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Mar. 13th, 2007 07:50 pm (UTC)
Link

I'm working on a benchmark for secondary at the moment :)

The point really is about scale out though, and removing the need for you to have to worry about object conflict.

Reply | Parent | Thread

peter_zaitsev

(no subject)

from: peter_zaitsev
date: Mar. 13th, 2007 08:38 pm (UTC)
Link

Well it is different point. You've started with performance thing here.

If you want to scale out and performance goal is secondary (ie you do not care if you would need 10 servers of 15 for the same stuff) UUID is good way to go.

Normally it is not that hard to generate sequences in the application which allow you to scale out while fitting in 64bit and maintaining uniqueness and locality for efficient btree handling.

I remember Monty even wanted to implement that on MySQL server with OID.

Reply | Parent | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Mar. 13th, 2007 08:43 pm (UTC)
Link

This was opened up because of a customer who was told "primary key lookup with GUID will be too slow". Turns out its not at all.

Memory usage is different, and secondary lookup when the primary key is a GUID may also be an issue (which I am going to test).

Generating sequences introduces a bottleneck (even if you create servers that do this via range). Personally I don't like to rely on this.

Monty has been bugging me to push my UUID field type, I'll do it sometime in the next few months.

Reply | Parent | Thread

peter_zaitsev

(no subject)

from: peter_zaitsev
date: Mar. 13th, 2007 09:02 pm (UTC)
Link

I see. That is important explanation. UUID point lookup by primary key will not be slower in most cases - if you lookup by random values there is little difference and it mainly happens in very niche case of BTREE would fit in memory for int but would not for UUID.

Reply | Parent | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Mar. 13th, 2007 09:03 pm (UTC)
Link

This customer was being told that UUID lookup would be much slower then autoincrement... I didn't know the answer. I thought it would be the same, and it turned out to be the same.

On an interesting note, this is a good way to find out if an engine has a crappy autoincrement design. If GUID insert is noticeably faster... then someone has messed up the autoincrement design.

Reply | Parent | Thread

peter_zaitsev

(no subject)

from: peter_zaitsev
date: Mar. 13th, 2007 09:15 pm (UTC)
Link

Right. Agree on storage engine testing.
Regarding lookups you need to make sure about key distribution for lookups to do benchmarks.

I've posted some of my comments on this topic on my blog:
http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/

Reply | Parent | Thread

Not your god

(no subject)

from: notagod
date: Mar. 14th, 2007 12:05 pm (UTC)
Link

It's interesting to see the timings, thanks for that. The only downside I see (in a current implementation in my current place of employment), are that simplistic database engines (read SQLite in this case), don't have types large enough to store UUID's as a native integer type, and so the programmer ends up writing them to a blob field. Wazoo pain that one. Luckily, sqlite will index them, but I know of several engines out there that won't :/ Interbase for one. We did however come up for a workable solution in the end tho ;) I guess you'd call them HalfUID's :) UUID's munged down to 32 bits.

Reply | Thread

Not your god

(no subject)

from: notagod
date: Mar. 14th, 2007 12:23 pm (UTC)
Link

Oh, and just to enlighten a bit, our use case scenario is this, so memory constraints and speed are very hign on the list, along side portability (PIM, document metainfo on local + removable storage, so globally unique is a design criteria).

Reply | Parent | Thread

leandro.myopenid.com

Don’t forget natural keys

from: leandro.myopenid.com
date: Mar. 14th, 2007 03:09 pm (UTC)
Link

Please don’t forget no table should lack a natural key.

Reply | Thread

Brian "Krow" Aker

Re: Don’t forget natural keys

from: krow
date: Mar. 14th, 2007 05:15 pm (UTC)
Link

Not all tables have them. For instance when logging requests may come in at the same second, from the IP address, on the same object.

It would be great if you could count on having one, but you can not.

Reply | Parent | Thread

Thank you

from: oyunstar
date: Aug. 25th, 2010 03:18 pm (UTC)
Link

Many thanks for your information

Reply | Thread