Drizzle, Views And Triggers

« previous entry | next entry »
Jul. 15th, 2009 | 01:15 pm

I've been trading twitter responses back and forth with jswanhart of Kickfire recently on the state and planning of Views in Drizzle.

In Drizzle right now we do not have views. There are plans to add views which never "materialize", but that is still a couple of milestones off.

One of the problems when talking about views is that the word "materialize" has been over used.

To "materialize" a view, means that you take the view definition, turn it into a temporary table, and then join it against a query. In Drizzle we consider this a "no no".

Why? Because any query that does this is more then likely going to turn out to be to slow to be of any real use in our domain space. The same problem exists with subqueries though for the time being we are leaving them in. Subqueries at this point are only semi-useful. Many of the common methods of using them result in a materialized table, which just makes them too slow. Our plans involve eliminating these sorts of queries and only allowing ones which can be "merged".

This will be limiting, but experience shows that the current state of use is that most users don't want to be able to write a query that will under perform (unless I am wrong, Google has disabled the subquery system in their version of MySQL just to make sure no developer gets anywhere near this part of the system).

Frankly, I would like to take this one step further at some point, and eliminate cartesian joins unless they are explicitly requested. How often do you really want one? Pretty much never. Typically they are done by accident and you either wait out the mistake or start hitting "CTRL-C" hoping to kill the query before it eats up the machine.

So "materializing" is out, but what is not out is "materialized views". This is an SQL feature whereby a table is created and is updated based on entries from one or more tables, either syncronously or asycronously. This is something I do expect us to support at some point.

One of the strengths in the current design is that the replication system lives as a service to the micro-kernel.

What does this mean?

It means that we internally have triggers on any sort of DDL/DML event that occurs in the server. As soon as someone wants to write up some parser goodness we can have basic triggers in whatever language you want.

We also gain materialized views via the same method. The "tires" on this project have been kicked around but no one has started it yet.

Having everything built on the same entry point to the micro-kernel really simplifies the design of a lot of different components. We have someone doing the design on direct to memcached replication for instance (we will have UDF as well, but this will simplify cache coherency and not require you to modify any SQL to make it work).

Materialized Views are pretty cool, materializing a view though pretty much make you want to cry.


BTW It was just pointed out to me that Justin had some of his own thoughts about this.

Link | Leave a comment | Add to Memories | Share

Comments {26}

Justin Swanhart

(no subject)

from: swanhart
date: Jul. 15th, 2009 09:28 pm (UTC)
Link

Materialized Views can server a lot of very useful purposes. There are two distinct approaches to materializing views, based on when the view is actually updated. I'm unsure which you were referring to in your description, so I figured I'd quickly describe both.

Synchronous materialized view refresh algorithms update the views as soon as the commit happens on the base table. Since the view must always be "in-sync" with the base table, the materialized view update must be done either in two-phase commit, or in the same transaction as the DML. This usually means the algorithm must be implemented by triggers, which issue DML statments on materialized views during base table updates.

Asynchronous algorithms collect row changes over time and apply them in a batch, often long after the commit happened on the base table. The row changes are normally collected by a process which reads the database logs and produces tuples from the changes, or a trigger which stores changed tuples in an auxillary table.

The "rolling join propagation algorithm" as implemented by Flexviews, is an asynchronous algorithm. Changes are serialized and collected by some mechanism (trigger, replication, etc), and inserted into a change log table for each base table. The algorithm then reads from a combination of base tables and table change logs to produce deltas for the view. These are called 'compensation queries'. I've been reading a paper on synchronous materialization which supports LEFT JOIN, so I can see benefits for both algorithms, and can see how both could fit into Drizzle.

Synchronous algorithms have a much higher impact on individual DML transactions than asynchronous methods. Asynchronous methods can't materialize as many aggregate functions.

Since I've already invested a lot of time in understanding and implementing an asynchronous algorithm, I would start from there.

Reply | Thread

As I noted on the drizzle list...

from: jzawodn
date: Jul. 15th, 2009 09:43 pm (UTC)
Link

I like the idea of being able to either:

(1) know that materialized views will never be done (because they are not there)

OR

(2) have a conf file option to turn them off on a per-server basis

That would help to reduce risk and keep people from building [some] slow queries on our high-volume production boxes. Maybe we'd enable it (if the support existed) on our internal db replicas, but that's it.

Reply | Parent | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Jul. 15th, 2009 10:29 pm (UTC)
Link

The architecture in Drizzle is open enough to really do either... the only issue with asynchronous would be how to correctly restart from a crashed state (assuming you are doing durability with your tables).

Reply | Parent | Thread

Justin Swanhart

Resuming from crash state

from: swanhart
date: Jul. 15th, 2009 11:04 pm (UTC)
Link

Currently flexviews uses triggers, so the logs are written in the same transaction as the base table. There is a global transaction table which is locked by the triggers and results in transactions being serialized.

If the database crashes, the logs roll back along with the base table changes.

The replication stream must include a transaction_id or group_id so that transactions can be appropriately replayed as a group when restoring state after a crash.

Reply | Parent | Thread

Brian "Krow" Aker

Re: Resuming from crash state

from: krow
date: Jul. 15th, 2009 11:10 pm (UTC)
Link

So in Drizzle you could read raw from the stream... and only have to go back to the "storage" for a crash.

Reply | Parent | Thread

Justin Swanhart

Re: Resuming from crash state

from: swanhart
date: Jul. 15th, 2009 11:17 pm (UTC)
Link

Do only committed records come through the stream? Do all the tuples for a transaction come through at once? Are they marked as belonging together?

Reply | Parent | Thread

Brian "Krow" Aker

Re: Resuming from crash state

from: krow
date: Jul. 15th, 2009 11:19 pm (UTC)
Link

You get everything in the stream.. there are actually two "levels" to it. You could either drink in batch, or drink straight from the firehouse. Either way though... you will need to handle a possible rollback.

Reply | Parent | Thread

Justin Swanhart

Re: Resuming from crash state

from: swanhart
date: Jul. 15th, 2009 11:22 pm (UTC)
Link

Does the existing file plugin serializes the transactions? I'll have a look at the plugin source when I get a few free moments.

Reply | Parent | Thread

Brian "Krow" Aker

Re: Resuming from crash state

from: krow
date: Jul. 15th, 2009 11:24 pm (UTC)
Link

No, the log is not serialized. You can recreate that by looking at the id in the event structure.

Reply | Parent | Thread

Justin Swanhart

Re: Resuming from crash state

from: swanhart
date: Jul. 17th, 2009 11:38 pm (UTC)
Link

Which id? Is there a global transaction id? I need to serialize transactions in commit order.

Also, is the commit not complete until the replication stream accepts that the transaction has been logged to disk, applied remotely, etc?

Reply | Parent | Thread

Brian "Krow" Aker

Re: Resuming from crash state

from: krow
date: Jul. 17th, 2009 11:49 pm (UTC)
Link

There is an event id for a collection that is encoded. It gives you proper order.

Right now the transaction is doing a two phase commit, so until the last event which records the commit are done, the transaction will not be finished.

Reply | Parent | Thread

Justin Swanhart

Re: Resuming from crash state

from: swanhart
date: Jul. 15th, 2009 11:11 pm (UTC)
Link

The views are updated transactionally too. All updates to a view are done in a single transaction, so if the database crashes, a partially modifed materialized view won't be exposed. Flexviews was not designed to work on non-transactional tables, and no consistency or correctness can be guaranteed on them.

Reply | Parent | Thread

Brian "Krow" Aker

Re: Resuming from crash state

from: krow
date: Jul. 15th, 2009 11:13 pm (UTC)
Link

Tie'ing into replication gives you transactional state... so no issues there.

As far as engines go... it would be up to you to toss an error or not if the wrong engine was selected.

Reply | Parent | Thread

Terminology?

from: rpbouman.blogspot.com
date: Jul. 15th, 2009 10:31 pm (UTC)
Link

Hi!

No offense but to me your wording is at times quite confusing. Look at this:

"
There are plans to add "non-materializing views", but that is still a couple of milestones off.
...
One of the problems when talking about views is that the word 'materialize' has been over used.
To 'materialize' a view, it means that you take the view definition, turn it into a temporary table, and then join it against a query. In Drizzle we consider this a 'no no'.
...
Subqueries at this point are only semi-useful. Many of the common methods of using them result in a 'derived' table, aka a materialized table, which just makes them too slow.
...
So 'materializing' is out, but what is not out is 'materialized joins'.
This is an SQL feature whereby a table is created and is updated based on entries from other tables. This is something I do expect us to support at some point.
....
We can also have materialized views via the same method. The 'tires' on this project have been kicked around but no one has started it yet.
...
Materialized Views are pretty cool, materializing a view though pretty much make you want to cry.
"

All documentation I have seen so far uses the term "materialized view" to denote a special class of database view objects that use a base table to cache the result of the SQL query underlying the view. This is also what Justin is referring too AFAICS. For that reason, the term "non-materializing view" you used in the second paragraph is confusing to me - at a glance it looks like you mean "views, not materialized views" but I think you meant in fact views that can be evaluated without first evaluating the views' underlying query and storing the result in a temporary table.

Later on you refer to the term "materialized join". I have never seen that term, except as part of "materialized join view", in which case it refers to a materialized view that has an underlying query that contains a join. In your post, you seem to use the term "materialized join" to denote what is usually referred to as "materialized view" - ("This is an SQL feature whereby a table is created and is updated based on entries from other tables. "). A materialized view need not be based on a join query at all (I mean, a query involving a JOIN operation), therefore it is confusing to use the term "materialized join" if you actually mean "materialized view".

Another point that I found confusing is your use of the term "derived table". You seem to use that term to denote any case where a subquery is first evaluated and stored in a temptable for later use. So far, I've seen the term "derived table" being used exclusively for a special class of subqueries, "a subquery in the FROM clause", also known as an "inline view". The term derived table is normally used to denote this syntax construct, regardless of whether the unerlying implementation has to first evealuate the subquery and store it in a temporary table.

So I guess I'm saying I'm not really sure what you mean anymore when you say "materialized view".

kind regards,

Roland

Reply | Thread

Justin Swanhart

(no subject)

from: swanhart
date: Jul. 15th, 2009 10:58 pm (UTC)
Link

Some terms:

Materialization = Materialization is storing either in memory, or on disk, one or more tuples, either permanently, or temporarily.

View = A database object which references other database objects, as a SQL SELECT statement. When the view is accessed, ideally, the SQL statement accessing the view will be modified. This is known as "merging" the statement with the view. MySQL isn't able to use the merge optimization very often, and instead materializes the table. This results in low performance when using views.

Temporary tables = Temporary tables are a form of materialization which may be done either EXPLICITLY, ie, a temporary table is created and populated by DML, or IMPLICITLY, i.e., when a view is accessed which must be materialized to answer the query which uses the view. Implicit temporary tables are usually not indexed (may not be true in MySQL 6(blah whatever, I'm still calling it MySQL 6)...


Materialized View = A special class of database object which appears to be a normal table, but in fact is an object built from data in other tables. Materialized views are first class database objects. Materialized views are normally built from a SQL select statement, and the materialization is updated, either asynchronously or synchronously with table changes.

Materialized views can be indexed, partitioned or otherwise structured for improved query performance over "regular' views, particularly regular views which must be temporarily materialized.
Implicit materialization is bad for performance and should be avoided wherever possible.

Reply | Thread

Views - Sine qua non

from: anonymous
date: Jul. 16th, 2009 08:39 am (UTC)
Link

Hi,
Views are very useful to separate the physical database from logical views. I care little if they are materialized or materialising, just as long as they behave as views. But a DBMS without views "no no".

Reply | Thread

Brian "Krow" Aker

Re: Views - Sine qua non

from: krow
date: Jul. 16th, 2009 03:04 pm (UTC)
Link

There are plenty of database use patterns where performance is not an issue. If you just need something "to work", that is great, but that is not the problem I tend to look at solving.

Reply | Parent | Thread

Re: Views - Sine qua non

from: jaypipes.myopenid.com
date: Jul. 16th, 2009 05:13 pm (UTC)
Link

PeterG?

Don't know many others who would use the term Sine qua non so glibly... :)

In any case, views are indeed useful, typically for a DBA or reporting-type db user who doesn't care to repeat typing in very long or complex statements. However, views which degrade performance of the server don't belong in the core. In Drizzle, they will be built and enabled as a rewriter plugin which should not effect environments which do not need them.

Cheers!

Jay

Reply | Parent | Thread

Brian "Krow" Aker

Re: Views - Sine qua non

from: krow
date: Jul. 16th, 2009 05:40 pm (UTC)
Link

The part I strongly believe about views, and their value, is in the ability to encapsulate underlying changes to table structure/name. That has always been to me the single most useful feature they provide.

Reply | Parent | Thread

subqueries are "modular"

from: anonymous
date: Jul. 16th, 2009 10:30 am (UTC)
Link

Another key plus point for subqueries is that they are "modular". That is if I have to construct my queries dynamically (think search interface), its much easier to add filters via a subquery .. since you only need to inject the filter in a single places .. with a join you need to mess with the from clause as well .. furthermore you need to watch out for ambiguities in the select list.

regards,
Lukas
smith@pooteeweet.org

Reply | Thread

Brian "Krow" Aker

Re: subqueries are "modular"

from: krow
date: Jul. 16th, 2009 03:06 pm (UTC)
Link

Depending on what you are doing the sub-query is just being merged into the join (so the database is doing the work for you). This type of subquery to me is a perfect example of why/what you need to support.

Reply | Parent | Thread

(no subject)

from: anonymous
date: Jul. 16th, 2009 10:32 am (UTC)
Link

I'm wondering how far one can go down the "don't allow inefficient constructs" route. Should one not allow use of "WHERE column=const" construct unless there is an index on the column in question?

If you disallow VIEWs, why not disable the FROM subqueries also? At the moment VIEW handling is more efficient than FROM-clause subqueries...

-- Sergey P.

Reply | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Jul. 16th, 2009 03:11 pm (UTC)
Link

The FROM subquery clause raises my concern as well... I've not really made up my mind on it at this point. We have left subqueries alone for the moment... I am not sure when we can get around to tackling them.

On a different note, as fas as column=const goes with indexes... requiring users to create indexes is a complete hack, long term we need to figure out how to do performance and not require this (and I don't mean just go the direction of using column stores and sacrifice write performance).

Reply | Parent | Thread

Justin Swanhart

(no subject)

from: swanhart
date: Jul. 17th, 2009 11:50 pm (UTC)
Link

Are there plans to add hash joins to Drizzle?

The inline view (subquery in the from clause) is very convenient, and powerful, but suffers from the problem that the subquery often must be materialized with no indexes. Even if indexes could be added automatically, cardinality of the joined columns can be bad anyway.

To really properly support inline views which must be temporarily materialized I think you need a better join algorithm when indexes are not available. Of course, hash join only works for efficiently for equijoins and antijoins, and sometimes inline views are outer joined.

Or perhaps build bitmaps on the joined columns (ideally compressed bitmaps ala fastbit) as bitmap lookups have shown to be as efficient as btree index lookups and handle low cardinality data far better.

Reply | Parent | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Jul. 18th, 2009 03:18 am (UTC)
Link

I want to finish some refactoring first on the JOIN class before I get into do a hash join. Right now there are just a couple of things that I want to do there... that being said I'd like to see someone else step and spend sometime working on it.

Reply | Parent | Thread

(no subject)

from: dledwards
date: Jul. 17th, 2009 06:39 pm (UTC)
Link

"While we’re on news ways of doing MySQL, Brian 'Krow' Aker and his readers have a very worthwhile discussion of Drizzle, views and triggers."

Log Buffer #154

Reply | Thread