Assumptions, Drizzle

Oct. 22nd, 2008 | 11:00 am

What is the future of Drizzle? What sort of assumptions are you making?

  • Hardware

    On the hardware front I get a lot of distance saying "the future is 64bit, multi-core, and runs on SSD". This is a pretty shallow answer, and is pretty obvious to most everyone. It suits a sound bite but it is not really that revolutionary of a thought. To me the real question is "how do we use them".

    64bit means you have to change the way you code. Memory is now flat for the foreseeable future. Never focus on how to map around 32bit issues and always assume you have a large, flat, memory space available. Spend zero time thinking about 32bit.

    If you are thinking "multi-core" then think about it massively. Right now adoption is at the 16 core point, which means that if you are developing software today, you need to be thinking about multiples of 16. I keep asking myself "how will this work with 256 cores". Yesterday someone came to me with a solution to a feature we have removed in drizzle. "Look we removed all the locks!". Problem was? The developer had used a compare and swap, CAS, operation to solve the problem. Here is the thing, CAS does not scale with this number of cores/chips that will be in machines. The good thing is the engineer got this, and has a new design :) We won't adopt short term solutions that just kneecap us in the near future.

    SSD is here, but it is not here in the sizes needed. What I expect us to do is make use of SSD as a secondary cache, and not look at it as the primary at rest storage. I see a lot of databases sitting in the 20gig to 100gig range. The Library of Congress is 26 terabytes. I expect more scale up so systems will be growing faster in size. SSD is the new hard drive, and fixed disks are tape.

    The piece that I have commented least on is the nature of our micro-kernel. We can push pieces of our design out to other nodes. I do not assume Drizzle will live on a single machine. Network speed keeps going up, and we need to be able to tier the database out across multiple computers.

    One final thought about Hardware, we need 128bit ints. IPV6, UUID, etc, all of these types mean that we need single instruction operator for 16byte types.

  • Community Development

    Today 2/3 of our development comes from outside of the developers Sun pays to work on Drizzle. Even if we add more developers, I expect our total percentage to decrease and not increase. I believe we will see forks and that we have to find ways to help people maintain their forks. One very central piece of what we have to do is move code to the Edge, aka plugins. Thinking about the Edge, has to be a share value.

    I see forks as a positive development, they show potential ways we can evolve. Not all evolutionary paths are successful, but it makes us stronger to see where they go. I expect long term for groups to make distributions around Drizzle, I don't know that we will ever do that.

    Code drives decisions, and those who provide developers drive those decisions.

    While I started out focusing Drizzle on web technologies, we are seeing groups showing up to reuse our kernel in data warehousing and handsets (which is something I never predicted). By keeping the core small we invite groups to use us as a piece to build around.

    Drizzle is not all about my vision, it is about where the collective vision takes us.

  • Directions in Database Technology

    Map/Reduce will kill every traditional data warehousing vendor in the market. Those who adapt to it as a design/deployment pattern will survive, the rest won't. Database systems that have no concept of being multiple node are pretty much dead. If there is no scale out story, then there is not future going forward.

    The way we store data will continue to evolve and diversify. Compression has gotten cheap and processor time has become massive. Column stores will continue to evolve, but they are not a "solves everything" sort of solution. One of the gambles we continue to make is to allow for storage via multiple methods (we refer to this as engines). We will be adding a column store in the near the future, it is an import piece for us to have. Multiple engines cost us in code complexity, but we continue to see value in it. We though will raise the bar on engine design in order to force the complexity of this down to the engine (which will give us online capabilities).

    Stored procedures are the dodos for database technology. The languages vendors have designed are limited. On the same token though, putting processing near the data is key to performance for many applications. We need a new model badly, and this model will be a pushdown from two different directions. One direction is obvious, map/reduce, the other direction is the asynchronous queues we see in most web shops. There is little talk about this right now in the blogosphere, but there is a movement toward queueing systems. Queueing systems are a very popular topic in the hallway tracks of conferences.

    Databases need to learn how to live in the cloud. We cannot have databases be silos of authentication, processing, and expect only to provide data. We must make our data dictionaries available in the cloud, we need to take our authentication from the cloud, etc...

    We need to live in the cloud.
  • Link | Leave a comment {8} | Add to Memories | Tell a Friend

    Engines, On the State of

    Oct. 13th, 2008 | 09:31 am

    So many engines, and so little to choose from. This is one of our two major decision points in Drizzle right now.

    Let me explain.

    Today we have Innodb, Maria, Falcon, and PBXT.

    Simple?

    Not really. Innodb is not a single engine, it is three engines. We have the default one which is shipped. It has been the wunderkinder for years now but has been showing its age. Go buy a piece of hardware that has four cores and it quickly becomes apparent that it is not aging well. There is the Innodb plugin, and while it delivers on features, performance still evades it. Both are works of the Innodb team at Oracle. The development style for Innodb has never been open, but they have always consistently delivered. Right now though? This delivery seems to be slowing. Since they do not function in an open model it is very hard to work with them. This means we have to shoulder most of the work, though the Innodb team has been responsive to questions.

    We have the Innodb produced by Google. It is of the standard design, but has been modified with performance patches. These are widely believed, and often show, performance increases on hardware above four cores. The issues around this engine are more about maintenance. Google is happy to drop its patches out the door, but shows no sign of wanting to bundle these into a release. This makes perfect sense, they aren't in the business of releasing databases. The Google developers are doing a good job of getting their patches out in chunks and seem genuinely interested in getting them into trees (though they themselves do not do this work). They are not though a committed team, they are group focused inwards who get open source enough to understand that publishing their patches is a good thing. There may be an answer in looking at Percona's builds, this is an unexplored option at this point. They have been doing releases with the Google Innodb code. Their development model is not open. They do have an outward facing view of the world though since they work as consultants.

    Maria continues to move along, but it is not transactional at this point. This makes it a non-starter. When they get it working, then it gets a ticket to the ballpark. It also hooks in deeper to the server then any of the other engines (aka bypasses the engine interface). It relies on the mysys library that MySQL ships. This makes it for us more difficult to work with, though all problems are solve-able. It is not being developed at a very quick pace.

    Falcon has been released in the Alpha 6.0 MySQL tree. It though is alpha and has not shown to perform well in general against Innodb. It keeps going through design changes so it is not really a contender for use at this point. On the plus side for me it keeps to itself and the code is distributed as a complete library. Which means if we did integrate it into Drizzle it would be relatively simple. It has an active development team. To this date though we have not worked with them at all.

    PBXT has shown over time steady improvement. It is hard for me to gauge at this point where it is in its development cycle. We have just pulled it into Drizzle recently and we know it fails some of our tests (keep in mind, the test system is only designed to test MyISAM, we have found bugs galore in shifting to Innodb as the default engine). Right now its design lends more to performance around indexes. Scans are still a performance bottleneck. This might be fine in our world, since for the web you typically only read from indexes. It does require row based replication and this is at issue in the server at general (someday soon there will be a long blog post by me on the sorry state of replication). Paul, the main developer, has been very active though and this wins big kudos from me personally. I would rather work with active developers and help them fix their work, and skip working with folks who are not so active.

    So this is the state of it. I have a few other random thoughts, but at the moment I am left with the question of "what to do in the future". We have had a few attempts at merges from the different Innodb trees, but so far none of these have been completed. PBXT is moving along well and we have begun to take patches from Paul to help him, and us, with testing. A couple of the Falcon folks have approached me about getting a tree working with their engine, but nothing has come of that. If the Maria team can kick out a better MyISAM I am open to replacing ours, though this is not a priority.

    Paul's recent changes make it much easier for us to maintain an active PBXT tree and Innodb tree.

    So what is the future?

    I am not sure at this juncture. We will continue down the path of trees for PBXT and Innodb. Those are the contenders at this point and no matter the performance issues with Innodb, it is prudent to keep it around because of its stability.

    Next year though? I am not sure.

    Next year is coming quickly though.

    Link | Leave a comment {6} | Add to Memories | Tell a Friend

    Drizzle talk from MySQL Developer's Conference

    Sep. 22nd, 2008 | 07:47 pm


    Drizzle Talk
    View SlideShare presentation



    Drizzle talk for MySQL Developer's Meeting.


    SlideShare Link

    Link | Leave a comment {2} | Add to Memories | Tell a Friend