Drizzle, Views And Triggers
« previous entry | next entry »
Jul. 15th, 2009 | 01:15 pm
I've been trading twitter responses back and forth with jswanhart of Kickfire recently on the state and planning of Views in Drizzle.
In Drizzle right now we do not have views. There are plans to add views which never "materialize", but that is still a couple of milestones off.
One of the problems when talking about views is that the word "materialize" has been over used.
To "materialize" a view, means that you take the view definition, turn it into a temporary table, and then join it against a query. In Drizzle we consider this a "no no".
Why? Because any query that does this is more then likely going to turn out to be to slow to be of any real use in our domain space. The same problem exists with subqueries though for the time being we are leaving them in. Subqueries at this point are only semi-useful. Many of the common methods of using them result in a materialized table, which just makes them too slow. Our plans involve eliminating these sorts of queries and only allowing ones which can be "merged".
This will be limiting, but experience shows that the current state of use is that most users don't want to be able to write a query that will under perform (unless I am wrong, Google has disabled the subquery system in their version of MySQL just to make sure no developer gets anywhere near this part of the system).
Frankly, I would like to take this one step further at some point, and eliminate cartesian joins unless they are explicitly requested. How often do you really want one? Pretty much never. Typically they are done by accident and you either wait out the mistake or start hitting "CTRL-C" hoping to kill the query before it eats up the machine.
So "materializing" is out, but what is not out is "materialized views". This is an SQL feature whereby a table is created and is updated based on entries from one or more tables, either syncronously or asycronously. This is something I do expect us to support at some point.
One of the strengths in the current design is that the replication system lives as a service to the micro-kernel.
What does this mean?
It means that we internally have triggers on any sort of DDL/DML event that occurs in the server. As soon as someone wants to write up some parser goodness we can have basic triggers in whatever language you want.
We also gain materialized views via the same method. The "tires" on this project have been kicked around but no one has started it yet.
Having everything built on the same entry point to the micro-kernel really simplifies the design of a lot of different components. We have someone doing the design on direct to memcached replication for instance (we will have UDF as well, but this will simplify cache coherency and not require you to modify any SQL to make it work).
Materialized Views are pretty cool, materializing a view though pretty much make you want to cry.
BTW It was just pointed out to me that Justin had some of his own thoughts about this.
In Drizzle right now we do not have views. There are plans to add views which never "materialize", but that is still a couple of milestones off.
One of the problems when talking about views is that the word "materialize" has been over used.
To "materialize" a view, means that you take the view definition, turn it into a temporary table, and then join it against a query. In Drizzle we consider this a "no no".
Why? Because any query that does this is more then likely going to turn out to be to slow to be of any real use in our domain space. The same problem exists with subqueries though for the time being we are leaving them in. Subqueries at this point are only semi-useful. Many of the common methods of using them result in a materialized table, which just makes them too slow. Our plans involve eliminating these sorts of queries and only allowing ones which can be "merged".
This will be limiting, but experience shows that the current state of use is that most users don't want to be able to write a query that will under perform (unless I am wrong, Google has disabled the subquery system in their version of MySQL just to make sure no developer gets anywhere near this part of the system).
Frankly, I would like to take this one step further at some point, and eliminate cartesian joins unless they are explicitly requested. How often do you really want one? Pretty much never. Typically they are done by accident and you either wait out the mistake or start hitting "CTRL-C" hoping to kill the query before it eats up the machine.
So "materializing" is out, but what is not out is "materialized views". This is an SQL feature whereby a table is created and is updated based on entries from one or more tables, either syncronously or asycronously. This is something I do expect us to support at some point.
One of the strengths in the current design is that the replication system lives as a service to the micro-kernel.
What does this mean?
It means that we internally have triggers on any sort of DDL/DML event that occurs in the server. As soon as someone wants to write up some parser goodness we can have basic triggers in whatever language you want.
We also gain materialized views via the same method. The "tires" on this project have been kicked around but no one has started it yet.
Having everything built on the same entry point to the micro-kernel really simplifies the design of a lot of different components. We have someone doing the design on direct to memcached replication for instance (we will have UDF as well, but this will simplify cache coherency and not require you to modify any SQL to make it work).
Materialized Views are pretty cool, materializing a view though pretty much make you want to cry.
BTW It was just pointed out to me that Justin had some of his own thoughts about this.
(no subject)
from:
swanhart
date: Jul. 15th, 2009 09:28 pm (UTC)
Link
Synchronous materialized view refresh algorithms update the views as soon as the commit happens on the base table. Since the view must always be "in-sync" with the base table, the materialized view update must be done either in two-phase commit, or in the same transaction as the DML. This usually means the algorithm must be implemented by triggers, which issue DML statments on materialized views during base table updates.
Asynchronous algorithms collect row changes over time and apply them in a batch, often long after the commit happened on the base table. The row changes are normally collected by a process which reads the database logs and produces tuples from the changes, or a trigger which stores changed tuples in an auxillary table.
The "rolling join propagation algorithm" as implemented by Flexviews, is an asynchronous algorithm. Changes are serialized and collected by some mechanism (trigger, replication, etc), and inserted into a change log table for each base table. The algorithm then reads from a combination of base tables and table change logs to produce deltas for the view. These are called 'compensation queries'. I've been reading a paper on synchronous materialization which supports LEFT JOIN, so I can see benefits for both algorithms, and can see how both could fit into Drizzle.
Synchronous algorithms have a much higher impact on individual DML transactions than asynchronous methods. Asynchronous methods can't materialize as many aggregate functions.
Since I've already invested a lot of time in understanding and implementing an asynchronous algorithm, I would start from there.
Reply | Thread
As I noted on the drizzle list...
from:
jzawodn
date: Jul. 15th, 2009 09:43 pm (UTC)
Link
(1) know that materialized views will never be done (because they are not there)
OR
(2) have a conf file option to turn them off on a per-server basis
That would help to reduce risk and keep people from building [some] slow queries on our high-volume production boxes. Maybe we'd enable it (if the support existed) on our internal db replicas, but that's it.
Reply | Parent | Thread
(no subject)
from:
krow
date: Jul. 15th, 2009 10:29 pm (UTC)
Link
Reply | Parent | Thread
Resuming from crash state
from:
swanhart
date: Jul. 15th, 2009 11:04 pm (UTC)
Link
If the database crashes, the logs roll back along with the base table changes.
The replication stream must include a transaction_id or group_id so that transactions can be appropriately replayed as a group when restoring state after a crash.
Reply | Parent | Thread
Re: Resuming from crash state
from:
krow
date: Jul. 15th, 2009 11:10 pm (UTC)
Link
Reply | Parent | Thread
Re: Resuming from crash state
from:
swanhart
date: Jul. 15th, 2009 11:17 pm (UTC)
Link
Reply | Parent | Thread
Re: Resuming from crash state
from:
krow
date: Jul. 15th, 2009 11:19 pm (UTC)
Link
Reply | Parent | Thread
Re: Resuming from crash state
from:
swanhart
date: Jul. 15th, 2009 11:22 pm (UTC)
Link
Reply | Parent | Thread
Re: Resuming from crash state
from:
krow
date: Jul. 15th, 2009 11:24 pm (UTC)
Link
Reply | Parent | Thread
Re: Resuming from crash state
from:
swanhart
date: Jul. 17th, 2009 11:38 pm (UTC)
Link
Also, is the commit not complete until the replication stream accepts that the transaction has been logged to disk, applied remotely, etc?
Reply | Parent | Thread
Re: Resuming from crash state
from:
krow
date: Jul. 17th, 2009 11:49 pm (UTC)
Link
Right now the transaction is doing a two phase commit, so until the last event which records the commit are done, the transaction will not be finished.
Reply | Parent | Thread
Re: Resuming from crash state
from:
swanhart
date: Jul. 15th, 2009 11:11 pm (UTC)
Link
Reply | Parent | Thread
Re: Resuming from crash state
from:
krow
date: Jul. 15th, 2009 11:13 pm (UTC)
Link
As far as engines go... it would be up to you to toss an error or not if the wrong engine was selected.
Reply | Parent | Thread
Terminology?
from:
rpbouman.blogspot.com
date: Jul. 15th, 2009 10:31 pm (UTC)
Link
No offense but to me your wording is at times quite confusing. Look at this:
"
There are plans to add "non-materializing views", but that is still a couple of milestones off.
...
One of the problems when talking about views is that the word 'materialize' has been over used.
To 'materialize' a view, it means that you take the view definition, turn it into a temporary table, and then join it against a query. In Drizzle we consider this a 'no no'.
...
Subqueries at this point are only semi-useful. Many of the common methods of using them result in a 'derived' table, aka a materialized table, which just makes them too slow.
...
So 'materializing' is out, but what is not out is 'materialized joins'.
This is an SQL feature whereby a table is created and is updated based on entries from other tables. This is something I do expect us to support at some point.
....
We can also have materialized views via the same method. The 'tires' on this project have been kicked around but no one has started it yet.
...
Materialized Views are pretty cool, materializing a view though pretty much make you want to cry.
"
All documentation I have seen so far uses the term "materialized view" to denote a special class of database view objects that use a base table to cache the result of the SQL query underlying the view. This is also what Justin is referring too AFAICS. For that reason, the term "non-materializing view" you used in the second paragraph is confusing to me - at a glance it looks like you mean "views, not materialized views" but I think you meant in fact views that can be evaluated without first evaluating the views' underlying query and storing the result in a temporary table.
Later on you refer to the term "materialized join". I have never seen that term, except as part of "materialized join view", in which case it refers to a materialized view that has an underlying query that contains a join. In your post, you seem to use the term "materialized join" to denote what is usually referred to as "materialized view" - ("This is an SQL feature whereby a table is created and is updated based on entries from other tables. "). A materialized view need not be based on a join query at all (I mean, a query involving a JOIN operation), therefore it is confusing to use the term "materialized join" if you actually mean "materialized view".
Another point that I found confusing is your use of the term "derived table". You seem to use that term to denote any case where a subquery is first evaluated and stored in a temptable for later use. So far, I've seen the term "derived table" being used exclusively for a special class of subqueries, "a subquery in the FROM clause", also known as an "inline view". The term derived table is normally used to denote this syntax construct, regardless of whether the unerlying implementation has to first evealuate the subquery and store it in a temporary table.
So I guess I'm saying I'm not really sure what you mean anymore when you say "materialized view".
kind regards,
Roland
Reply | Thread
(no subject)
from:
swanhart
date: Jul. 15th, 2009 10:58 pm (UTC)
Link
Materialization = Materialization is storing either in memory, or on disk, one or more tuples, either permanently, or temporarily.
View = A database object which references other database objects, as a SQL SELECT statement. When the view is accessed, ideally, the SQL statement accessing the view will be modified. This is known as "merging" the statement with the view. MySQL isn't able to use the merge optimization very often, and instead materializes the table. This results in low performance when using views.
Temporary tables = Temporary tables are a form of materialization which may be done either EXPLICITLY, ie, a temporary table is created and populated by DML, or IMPLICITLY, i.e., when a view is accessed which must be materialized to answer the query which uses the view. Implicit temporary tables are usually not indexed (may not be true in MySQL 6(blah whatever, I'm still calling it MySQL 6)...
Materialized View = A special class of database object which appears to be a normal table, but in fact is an object built from data in other tables. Materialized views are first class database objects. Materialized views are normally built from a SQL select statement, and the materialization is updated, either asynchronously or synchronously with table changes.
Materialized views can be indexed, partitioned or otherwise structured for improved query performance over "regular' views, particularly regular views which must be temporarily materialized.
Implicit materialization is bad for performance and should be avoided wherever possible.
Reply | Thread
Views - Sine qua non
from: anonymous
date: Jul. 16th, 2009 08:39 am (UTC)
Link
Views are very useful to separate the physical database from logical views. I care little if they are materialized or materialising, just as long as they behave as views. But a DBMS without views "no no".
Reply | Thread
Re: Views - Sine qua non
from:
krow
date: Jul. 16th, 2009 03:04 pm (UTC)
Link
Reply | Parent | Thread
Re: Views - Sine qua non
from:
jaypipes.myopenid.com
date: Jul. 16th, 2009 05:13 pm (UTC)
Link
Don't know many others who would use the term Sine qua non so glibly... :)
In any case, views are indeed useful, typically for a DBA or reporting-type db user who doesn't care to repeat typing in very long or complex statements. However, views which degrade performance of the server don't belong in the core. In Drizzle, they will be built and enabled as a rewriter plugin which should not effect environments which do not need them.
Cheers!
Jay
Reply | Parent | Thread
Re: Views - Sine qua non
from:
krow
date: Jul. 16th, 2009 05:40 pm (UTC)
Link
Reply | Parent | Thread
subqueries are "modular"
from: anonymous
date: Jul. 16th, 2009 10:30 am (UTC)
Link
regards,
Lukas
smith@pooteeweet.org
Reply | Thread
Re: subqueries are "modular"
from:
krow
date: Jul. 16th, 2009 03:06 pm (UTC)
Link
Reply | Parent | Thread
(no subject)
from: anonymous
date: Jul. 16th, 2009 10:32 am (UTC)
Link
If you disallow VIEWs, why not disable the FROM subqueries also? At the moment VIEW handling is more efficient than FROM-clause subqueries...
-- Sergey P.
Reply | Thread
(no subject)
from:
krow
date: Jul. 16th, 2009 03:11 pm (UTC)
Link
On a different note, as fas as column=const goes with indexes... requiring users to create indexes is a complete hack, long term we need to figure out how to do performance and not require this (and I don't mean just go the direction of using column stores and sacrifice write performance).
Reply | Parent | Thread
(no subject)
from:
swanhart
date: Jul. 17th, 2009 11:50 pm (UTC)
Link
The inline view (subquery in the from clause) is very convenient, and powerful, but suffers from the problem that the subquery often must be materialized with no indexes. Even if indexes could be added automatically, cardinality of the joined columns can be bad anyway.
To really properly support inline views which must be temporarily materialized I think you need a better join algorithm when indexes are not available. Of course, hash join only works for efficiently for equijoins and antijoins, and sometimes inline views are outer joined.
Or perhaps build bitmaps on the joined columns (ideally compressed bitmaps ala fastbit) as bitmap lookups have shown to be as efficient as btree index lookups and handle low cardinality data far better.
Reply | Parent | Thread
(no subject)
from:
krow
date: Jul. 18th, 2009 03:18 am (UTC)
Link
Reply | Parent | Thread
(no subject)
from:
dledwards
date: Jul. 17th, 2009 06:39 pm (UTC)
Link
Log Buffer #154
Reply | Thread