? ?

Scaling, Database Direction, Macro and Micro Scaling

« previous entry | next entry »
Sep. 21st, 2007 | 04:22 pm

Each year I pick a topic to explore for conferences. I look at trends, do some research, and I write a slide deck to give a talk.

Then the learning begins. As I go around the country, and the world, giving the talk, I get to hear from others. Learning from the collective lets me find new ideas and refine my own thoughts on the topic. Some ideas I hear over and over, and these bubble to the top.

This year's topic was scaling. To date I've given the "Scaling" talk as a keynote three times, and as a regular session another four times (and I need to apologize to at least two conferences that I had to
skip or I would have delivered it another two times).

At the moment Architects are looking at two forms of scaling, Macro and Micro. The computing clouds, distributed processing systems, routing systems, and proxies are the macro scaling solutions. On the Micro side it is all about compare and swap operators, threading systems, asynchronous IO, and multi CPU locking.

Macro scaling is in essence the problem of how to make use of lots of cheap commodity computers. Micro is about how to make use of the multi core, and multi CPU systems which are now plentiful. I refer to these as Multi "Way" machines, and they have been made viable now that we have memory address spaces in CPU that are capable of making use of more memory.

There is an ongoing play between distributed disk and the local disk. Its being understood that distributed disk are viable, but only when seen as one piece in an overall caching layer. Localized disk being the next layer in the cache, followed by RAM, and CPU caches. Scaling Architects have been working and designing around the constraints of disk. Today the viable, which really means affordable, size of local disks is around one terabyte. Data sets are localized based on either the size of the disk, or the size off RAM.

Either way though, problems are being broken up based on one of these two limitations. The trend shows more memory for RAM based solutions will be available, though with a slow growth. Disk on the other hand sits at a fork in the road. On one side we have slowly evolving disks with traditional spindle that are growing slow for random access. The other path leads toward solid state disks.

Solid state disks today are smaller then conventional disks. Somewhere in the neighborhood of 32 to 64 gigabytes. They will not grow for sometime to be the size of traditional disk, nor will they completely replace traditional disks in all applications today. In the cases where they do, it will mean that Architects have decided to use this as a limitation in Macro scaling. In other cases solid state disks will be used as another cache layer between traditional disk and RAM.

Either way we can look at computing clouds and nodes for distributed processing to be designed around these storage limitations with distributed filesystems being used as a final cache for storage in some environments.

If I tie this back to the database world it means that we are shaped by these decisions. We have to look at the scale of size, and design database systems which make full use of provided resources. This means threading and distributing problems to nodes and the specialization of databases around classes or problems.

Caching, transaction processing, analytics, and temporal solutions is how I am classifying these problems.

Caching means having access to objects of data in a quick manner. The lookup path for these system are key based. Need an object, fetch by key. The variance of the solution to this problem where you need to look for a particular item based on time. AKA you need objects from a particular point in time.

Transaction processing is both the oldest and perhaps the most written about solution. It is evolving. Transactions are now less defined by long running problems, and more by short collections of queries around a single problem. Like the web, it is all about delivery on response to a single question.

Analytics is the need to process the vast stores of data that are being generated. In Macro scaling this is being tied to distributed processing systems. Stores of data exist, and the problem is to read
the collection of the data to find aggregate results. Stonebraker's
writings
around column oriented databases are reflective of thisproblem.

The final leg of database solutions is the temporal problem. The need to collect data from a source and only keep a snapshot of the most recent collection of data. We see this need for real time processing to know information on changes of stock, movement of weather, and real time queues for parallelizing data.

What does this mean for Architects? The four solutions require Architects to design a waterfall system where data flows from one system to the next.

For those writing databases it means specializing solutions and movement away from monolithic approaches to solving data storage problems (database vendors are fabulous are regurgitating their single solutions over and over again beyond the stretch of their system's original design). That we see a prolific explosion in the open source world around databases is a sign of this.

Micro scaling means that the ante has raised for writing database systems. Threaded approaches are required to make use of current hardware. Single thread solutions are now a waste of resources.

As a software engineer this means that the field is now open for new applications to be written. There are very few modern solutions to solving these problems, so opportunity exists in almost every niche to innovate.

To me personally it means that it is a fun time to be writing software.

Link | Leave a comment | | Flag

Comments {0}