?

Log in

No account? Create an account

Column Stores, Drizzle, Search For

« previous entry | next entry »
Nov. 1st, 2008 | 10:03 am

Last week when I commented on Directions in Database Technology and mentioned " Column stores will continue to evolve". I received a number of comments via IM, Twitter, and email from folks who wanted to know more about column stores (both in how they relate to Drizzle and their usage in general).

Very early on when we started work on Drizzle the plan was to focus web applications. When we looked at cutting features, one of the criteria was "is this needed for web deployment". In many cases we have leaned toward keeping functionality when it was clearly well designed and had a general usefulness. To give an example, ROLLUP for instance is not typically used for web applications, but it is a well written feature that provides us with functionality that we find is handy.

Rollup though is a feature I would typically group in the "Data Analytics" area. Did we keep it?

Yes, because it is useful in a general sense even if you are not doing data analytics (I also find it to be a gem that few MySQL DBAs know).

Early on with Drizzle I tried to discourage innovation outside of the web stack, but that has proven to be futile. The fact is, we provide a micro-kernel, and users will find uses for it. To me the core of what Drizzle is, is the micro-kernel. Anything other then the Micro-kernel is service, and these are required to build solutions. Trying to direct innovation is frankly something I should have known better then to try to do.

The short of this is that we will tackle data analytics in our own manner, and today that means we will eventually adopt a column store. Like map/reduce, column stores are one of the inevitable trends.

In the open source world, this means Infobright right now. If you look at Infobright, which has yet to be well known in open source circles, you see a concrete example of a column store which is well purposed. It is built on top of MySQL, but has its own enhanced parser for data analytics (the basic MySQL/Drizzle optimizer is poorly designed for this sort of work). To really get good performance you have to go the route that Infobright went in replacing the optimizer (the value add for "just an engine" is small, you really do need something more).

At some point I believe we will tackle those types of changes for our optimizer but I don't see the point in it right now. We aren't out to replace SQLite or Postgres, why fill a niche that Infobright already does well?

So then, what is the future of the column store as relates to Drizzle?

I believe the second most important decision we will make long term for engines is going to be which column store we pick up on. I suspect we might even need two.

Why two?

It is obvious that we will need one for data analytics. Using standard OLTP designs for data analytics does not work. This though is not our focus, so it is a long term need, not a short term one.

My interest is in one for shared nothing cloud services (which is in my personal area of interest). The contender for that at the moment looks to be HyperTable, but my opinion there is based on back of the napkin conclusions. We have to do an integration in order to determine if it pans out (and there are attempts right now to do this). There seems to be a number of groups interested in this, so I know it will happen.

As much as column stores are useful for data analytics, and probably required at this point, I believe there is a larger need for them in the space of cloud computing. They have a natural ability to scale out and I believe this will be key for the semi-structured nature that we see most often in Web Application data. While I expect setups of single node Drizzle databases, I also believe that we will need shared storage backends. These will obviously not be for OLTP uses in the beginning.

Skip ahead into the future though and the nature of MVCC design though, plus an optimistic optimizer, should allow engineers to eventually build out OLTP systems with shared nothing backends that make use of column stores. This is not on our current roadmap, but it is also not hard to see where the future might just go.

UPDATE Several people have made mention of LucidDB as being an open source column oriented database. I've only barely looked at it, so I can't say much about it.

Link | Leave a comment |

Comments {11}

distributed?

from: burtonator
date: Nov. 2nd, 2008 04:52 am (UTC)
Link

I think in this sense you mean a distributed column store.

Hypertable being the example you use?

Couldn't NDB also provide similar behavior? Since it's in memory based NDB erases some of the pros/cons of column stores vs row stores.

NDB would obviously be easier to port (I would think) especially as it stabilizes a bit more.

If Hypertable approaches the flexibility of a real world BigTable then it would be a win over NDB I would think.. especially if they implement the memory pinning features of BigTable.

Kevin

Reply | Thread

Brian "Krow" Aker

Re: distributed?

from: krow
date: Nov. 2nd, 2008 01:34 pm (UTC)
Link

I am not sure which is more stable, NDB or HyperTable. NDB tends to only be focused on fixing bugs for customers, so I am hoping HyperTable will be more responsive to community users.

The in-memory part is a huge downside for me, since we don't want to be constrained by memory.

Reply | Parent | Thread

Re: distributed?

from: anonymous
date: Nov. 2nd, 2008 10:27 pm (UTC)
Link

Note that in current NDB non-indexed columns can be disk based and if you look very long into the roadmap this will be true for indexes as well.

However the interesting question is, whether 64 datanodes is enough for your cloud computing monster database? I guess it is one of those magic numbers that are easy to raise, but it just shows that while NDB is a great scale-out architecture, not everyone yet may be thinking of a database on thousands of nodes.

Reply | Parent | Thread

Brian "Krow" Aker

Re: distributed?

from: krow
date: Nov. 2nd, 2008 10:36 pm (UTC)
Link

The limit of 64 datanodes is just a compile option (and I believe they raised it in the recent telecom trees).

Everything can change with "future development". I look though at what can be done today.

Reply | Parent | Thread