Column Stores, Drizzle, Search For

« previous entry | next entry »
Nov. 1st, 2008 | 10:03 am

Last week when I commented on Directions in Database Technology and mentioned " Column stores will continue to evolve". I received a number of comments via IM, Twitter, and email from folks who wanted to know more about column stores (both in how they relate to Drizzle and their usage in general).

Very early on when we started work on Drizzle the plan was to focus web applications. When we looked at cutting features, one of the criteria was "is this needed for web deployment". In many cases we have leaned toward keeping functionality when it was clearly well designed and had a general usefulness. To give an example, ROLLUP for instance is not typically used for web applications, but it is a well written feature that provides us with functionality that we find is handy.

Rollup though is a feature I would typically group in the "Data Analytics" area. Did we keep it?

Yes, because it is useful in a general sense even if you are not doing data analytics (I also find it to be a gem that few MySQL DBAs know).

Early on with Drizzle I tried to discourage innovation outside of the web stack, but that has proven to be futile. The fact is, we provide a micro-kernel, and users will find uses for it. To me the core of what Drizzle is, is the micro-kernel. Anything other then the Micro-kernel is service, and these are required to build solutions. Trying to direct innovation is frankly something I should have known better then to try to do.

The short of this is that we will tackle data analytics in our own manner, and today that means we will eventually adopt a column store. Like map/reduce, column stores are one of the inevitable trends.

In the open source world, this means Infobright right now. If you look at Infobright, which has yet to be well known in open source circles, you see a concrete example of a column store which is well purposed. It is built on top of MySQL, but has its own enhanced parser for data analytics (the basic MySQL/Drizzle optimizer is poorly designed for this sort of work). To really get good performance you have to go the route that Infobright went in replacing the optimizer (the value add for "just an engine" is small, you really do need something more).

At some point I believe we will tackle those types of changes for our optimizer but I don't see the point in it right now. We aren't out to replace SQLite or Postgres, why fill a niche that Infobright already does well?

So then, what is the future of the column store as relates to Drizzle?

I believe the second most important decision we will make long term for engines is going to be which column store we pick up on. I suspect we might even need two.

Why two?

It is obvious that we will need one for data analytics. Using standard OLTP designs for data analytics does not work. This though is not our focus, so it is a long term need, not a short term one.

My interest is in one for shared nothing cloud services (which is in my personal area of interest). The contender for that at the moment looks to be HyperTable, but my opinion there is based on back of the napkin conclusions. We have to do an integration in order to determine if it pans out (and there are attempts right now to do this). There seems to be a number of groups interested in this, so I know it will happen.

As much as column stores are useful for data analytics, and probably required at this point, I believe there is a larger need for them in the space of cloud computing. They have a natural ability to scale out and I believe this will be key for the semi-structured nature that we see most often in Web Application data. While I expect setups of single node Drizzle databases, I also believe that we will need shared storage backends. These will obviously not be for OLTP uses in the beginning.

Skip ahead into the future though and the nature of MVCC design though, plus an optimistic optimizer, should allow engineers to eventually build out OLTP systems with shared nothing backends that make use of column stores. This is not on our current roadmap, but it is also not hard to see where the future might just go.

UPDATE Several people have made mention of LucidDB as being an open source column oriented database. I've only barely looked at it, so I can't say much about it.

Link | Leave a comment | Share

Comments {11}

Here's another good reason

from: xaprb
date: Nov. 1st, 2008 01:16 pm (UTC)
Link

Here's another good reason to add column stores, plural: the current optimizer, in fact most of the server in my fairly limited experience, is solidly row-oriented. Adding in column-ability means the optimizer and a bunch of other stuff that might otherwise be coded in will have to become pluggable, or whatever the buzzword of the day is. And if you only do this for one storage engine, a lot of the things that ought to be factored out won't be; doing something for more than one interfacing bit of code always points out things that ought to be cleaned and separated.

Reply | Thread

Brian "Krow" Aker

Re: Here's another good reason

from: krow
date: Nov. 2nd, 2008 01:04 am (UTC)
Link

Actually... MySQL has always been able to do covering indexes. So supporting a column store from the standpoint of row retrieval is pretty easy. There are flags to determine read/write cases for column during a query.

Reply | Parent | Thread

distributed?

from: burtonator
date: Nov. 2nd, 2008 04:52 am (UTC)
Link

I think in this sense you mean a distributed column store.

Hypertable being the example you use?

Couldn't NDB also provide similar behavior? Since it's in memory based NDB erases some of the pros/cons of column stores vs row stores.

NDB would obviously be easier to port (I would think) especially as it stabilizes a bit more.

If Hypertable approaches the flexibility of a real world BigTable then it would be a win over NDB I would think.. especially if they implement the memory pinning features of BigTable.

Kevin

Reply | Thread

Brian "Krow" Aker

Re: distributed?

from: krow
date: Nov. 2nd, 2008 01:34 pm (UTC)
Link

I am not sure which is more stable, NDB or HyperTable. NDB tends to only be focused on fixing bugs for customers, so I am hoping HyperTable will be more responsive to community users.

The in-memory part is a huge downside for me, since we don't want to be constrained by memory.

Reply | Parent | Thread

Re: distributed?

from: anonymous
date: Nov. 2nd, 2008 10:27 pm (UTC)
Link

Note that in current NDB non-indexed columns can be disk based and if you look very long into the roadmap this will be true for indexes as well.

However the interesting question is, whether 64 datanodes is enough for your cloud computing monster database? I guess it is one of those magic numbers that are easy to raise, but it just shows that while NDB is a great scale-out architecture, not everyone yet may be thinking of a database on thousands of nodes.

Reply | Parent | Thread

Brian "Krow" Aker

Re: distributed?

from: krow
date: Nov. 2nd, 2008 10:36 pm (UTC)
Link

The limit of 64 datanodes is just a compile option (and I believe they raised it in the recent telecom trees).

Everything can change with "future development". I look though at what can be done today.

Reply | Parent | Thread

Please do your homework before posting

from: jvdongen
date: Nov. 7th, 2008 12:58 pm (UTC)
Link

"In the open source world, this means Infobright right now". Well, not exactly, and surely not only Infobright. The Community Edition of Infobright has some severe limitations which makes it hardly usable for serious datawarehouse implementations. The most critical one is the removal of the DML capabilities from the CE. So you can load data, but you can do no updates and/or deletes. Then the alternatives: besides Infobright there's LucidDB (www.luciddb.org) and of course MonetDB (www.monetdb.com), probably the fastest analytical database on the planet, just look at the benchmarks (I can provide some more if you want to). All of these three have one major drawback however, and that's their MPP/Scale Out capabilities. There's none. So if you need MPP in an open source database, have a look at EnterpriseDB's PostgreSQL Plus with the GridSQL option. Currently, there's no open source database that will serve all your analytical needs, so the choice is to either combine software yourself or go for an appliance like Greenplum's that's using PostgreSQL as the core storage engine but added MPP capabilities, included Project R for analytics and integrated M/R algorithms for the heavy lifting. And yes, there's some value in M/R, but you'd still need SQL capabilities since M/R will only solve some of your problems.

Reply | Thread

Brian "Krow" Aker

Re: Please do your homework before posting

from: krow
date: Nov. 7th, 2008 09:35 pm (UTC)
Link

Taking Postgres and extending it in a proprietary manner does not result, no matter how well it performs, in the final product being open source.

Greenplum is not open source, and the same goes for EnterpriseDB.

Reply | Parent | Thread

Re: Please do your homework before posting

from: jvdongen
date: Nov. 7th, 2008 11:11 pm (UTC)
Link

Well, perhaps you need to do your homework as well then. EnterpriseDB's PostgreSQL Plus is Open Source, as is GridSQL (as of July this year). The only proprietary version of EnterpriseDB is the Advanced Server, but some parts of that (include GridSQL) are already released under GPL. Have a look at their website and check it out! Indeed, Greenplum isn't OS, but then again, I never said it was, just that it's a cost effective solution due to the fact that they're leveraging OS technology in a successful way.

Reply | Parent | Thread

Brian "Krow" Aker

Re: Please do your homework before posting

from: krow
date: Nov. 7th, 2008 11:30 pm (UTC)
Link

First "Well, perhaps you need to do your homework as well then", please lose the attitude. You will notice I updated the blog entry about LucidDB.

Next, poking around the Enterprise DB website it is not clear to me that anything other then "Postgres Core 8.3", which I suspect is the main PG db, is open source. The entire site is very hard to follow. If or if not something is open source seems to be a bit obfuscated through out the site.

As far as Greenplum goes, "leveraging OS technology in a successful way" does not mean you need to attach open source to it in any manner. MS uses open source libraries but they do not BS folks around whether or not they are open source. Greenplum, from everything I have seen and have been told, is a great product. You don't need to try to draw some line between it and open source to validate its existence. The market does a good job of that on its own. If I look around at what is provided in the open source world today, I see nothing that currently has the feature set of Greenplum. Really, there is no reason to try to have if masquerade as open source.


Reply | Parent | Thread

Re: Please do your homework before posting

from: jvdongen
date: Nov. 8th, 2008 08:51 am (UTC)
Link

I'm really sorry about the attitude thing; wasn't intended and also very stupid. I'll take better care next time.

Good point on Greenplums OS claims but then again, it's great technology! I agree that more and more companies are adding the "New and Improved, Now with Open Source!" label to their product to get more attention, or deliver a crippled "Community Edition" (like InfoBright!) which is nice to play with but useless in production environments.

Reply | Parent | Thread