Prepared Statements, Musings
« previous entry | next entry »
Jun. 9th, 2008 | 03:45 pm
In the back of my mind for the last couple of weeks have been some musings on prepared statements.
So I walk into this customers office and they don't speak english.
I only speak a few words of their language.
It goes like this:
"Blah, blah, blah, Prepared Statements, blah, blah..."
I say "Turn them off, they are what is crashing your application."
The end customer is confused because the translator had not finished before I gave the answer.
So I explain the problem... the customer has a lot of connections to the database... a lot... and the database is running out of memory and crashing. Was this the customer's problem? Yes it was.
This is a practiced conversation for me. I could do a song and dance on "well you should be building your application like..." but the customer wants to build their application their way. They want a lot of SQL to exist that they are going to generate on the fly. The shortest answer is "turn them off".
So why are prepared statements a problem?
Because users do not clean up/close unused prepared statements. Multiply the number of prepared statements times the number of open connections and you begin to see the problem.
What do you do about this?
Turn them off. The Java driver and a few other drivers turn them off automatically.
Why?
Because unless you really happen to know what you are doing with them, you are going to crash the database (somewhere in the back of my mind is "this is why we moved to garbage collected languages").
I'm not going to defend the design.
Are they faster?
Well "yes" if you have blobs, otherwise... still "yes" for most cases.
Each connection thread in MySQL keeps its own copy. There is no global pool of prepared statements on the server.
And if there was a global pool?
We would need a bunch of read/write locks. Locks are a real problem right now across the industry. I do not believe that any application being built today which has to share data/state has this solved.
Locks harm scaleability. Sure... you could keep a cache of them local to the connection that reads from a central cache, but this would just be pushing out the problem with caching (which is the curren trend in scaling work arounds). The problem is that we see MySQL set ups with thousands of connections, so even this solution is going to chew up a lot of memory.
So what am I thinking?
Ditch the current design completely. Push the parser out to the client, and let it send out a stripped down bytecode language which can be de-serialized back to something we can execute on.
Just skip the entire "lets put this in the server" approach for prepared statements. No cache needed, and you push processing off to the client where it belongs. Will it require a new extension to the protocol? Yes.
The client library would also get a bit bigger, but that does not concern me as much. You can pull sql_yacc.yy and a few functions into a library and get away from the problem of supporting two parsers.
To someone writing against the database there would be no change, dynamic SQL would continue to work as is. The difference is all in what the client is doing internally, and what it is shipping over the wire.
Heck, its even modular at this point :)
Just thinking about a solution...
So I walk into this customers office and they don't speak english.
I only speak a few words of their language.
It goes like this:
"Blah, blah, blah, Prepared Statements, blah, blah..."
I say "Turn them off, they are what is crashing your application."
The end customer is confused because the translator had not finished before I gave the answer.
So I explain the problem... the customer has a lot of connections to the database... a lot... and the database is running out of memory and crashing. Was this the customer's problem? Yes it was.
This is a practiced conversation for me. I could do a song and dance on "well you should be building your application like..." but the customer wants to build their application their way. They want a lot of SQL to exist that they are going to generate on the fly. The shortest answer is "turn them off".
So why are prepared statements a problem?
Because users do not clean up/close unused prepared statements. Multiply the number of prepared statements times the number of open connections and you begin to see the problem.
What do you do about this?
Turn them off. The Java driver and a few other drivers turn them off automatically.
Why?
Because unless you really happen to know what you are doing with them, you are going to crash the database (somewhere in the back of my mind is "this is why we moved to garbage collected languages").
I'm not going to defend the design.
Are they faster?
Well "yes" if you have blobs, otherwise... still "yes" for most cases.
Each connection thread in MySQL keeps its own copy. There is no global pool of prepared statements on the server.
And if there was a global pool?
We would need a bunch of read/write locks. Locks are a real problem right now across the industry. I do not believe that any application being built today which has to share data/state has this solved.
Locks harm scaleability. Sure... you could keep a cache of them local to the connection that reads from a central cache, but this would just be pushing out the problem with caching (which is the curren trend in scaling work arounds). The problem is that we see MySQL set ups with thousands of connections, so even this solution is going to chew up a lot of memory.
So what am I thinking?
Ditch the current design completely. Push the parser out to the client, and let it send out a stripped down bytecode language which can be de-serialized back to something we can execute on.
Just skip the entire "lets put this in the server" approach for prepared statements. No cache needed, and you push processing off to the client where it belongs. Will it require a new extension to the protocol? Yes.
The client library would also get a bit bigger, but that does not concern me as much. You can pull sql_yacc.yy and a few functions into a library and get away from the problem of supporting two parsers.
To someone writing against the database there would be no change, dynamic SQL would continue to work as is. The difference is all in what the client is doing internally, and what it is shipping over the wire.
Heck, its even modular at this point :)
Just thinking about a solution...
deja vu
from:
jimw
date: Jun. 9th, 2008 11:07 pm (UTC)
Link
you'd want some sort of cross-language parser generator, so that we wouldn't have to implement parsers in c, java, c#, python, perl, etc.
Reply | Thread
Re: deja vu
from:
krow
date: Jun. 9th, 2008 11:21 pm (UTC)
Link
Reply | Parent | Thread
Bytecodes
from:
kazuho
date: Jun. 10th, 2008 12:15 am (UTC)
Link
Yeah! I definitely agree that this is the way to go. It would no only improve server performance, but also opens the possibility of hand-coding access to storage engines (for better performance than what SQL optimizers can achieve).
Reply | Thread
Throwing the baby out with the bathwater
from:
bkarwin
date: Jun. 10th, 2008 01:20 am (UTC)
Link
Now static SQL is a distant memory, and everyone uses dynamic SQL. Why? Because now the server determines the SQL language support uniformly for all clients. You have greater assurance that the SQL language is identical for all clients connected to a given server, regardless of the client. Think of tech support nightmare when the SQL language support varies by the client programming language, client version, and operating system!
Also, you don't have to upgrade your clients when a language bug is fixed. Upgrade the server, and voila! All clients benefit. Deployment of SQL fixes and enhancements is so much easier.
There have to be alternative solutions to the problem you mention, of leaking PS data structures. Set a server cap on the memory allocation to these data structures. Implement an LRU algorithm to toss out unused PS. Make this behavior configurable on the server, enabled by default.
Your reasoning sounds like, "people frequently fill up and fragment their hard disk because they never empty their trash can (a true statement), so we should make sure no one can use a hard disk." WTF?!?
Reply | Thread
Re: Throwing the baby out with the bathwater
from:
krow
date: Jun. 10th, 2008 01:26 am (UTC)
Link
AKA no change to how applications are written today. This is about what is happening under the hood.
An LRU with a global cache still has locks, and MySQL doesn't do the second trick that many prepared statement designs does, which is to store the execution plan. So really the current implementation just gives you a way to send objects without escaping them (and a slightly parsed tree).
BTW I updated the post to clarify the point that dynamic SQL is still available after doing this. I hated embedded SQL.
Edited at 2008-06-10 01:41 am (UTC)
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
bkarwin
date: Jun. 10th, 2008 03:07 pm (UTC)
Link
I just can't understand the logical progression from "resource constraint on server" to "move parser to client". Won't that just change the location of the resource constraint problem, instead of solving the problem? Perhaps you have made some mental steps connecting the two but they are still in your head. :-)
I do understand that by moving the PS resource management to the client, it's less likely to crash the server. But if you ever want the server to cache optimization plans, you'd be back at square one, because I can't believe you're suggesting to move optimization to the client as well.
What about setting the resource cap per thread, and use a priority queue to implement an LRU at the thread level? When a PS is used, it moves to the beginning of the priority queue and assign it the current timestamp. When the PS pool is full, knock off the entry at the end, under the assumption it is the least recently used.
If the PS at the end has a timestamp so recent that you don't feel it's "stale", then either error out the creation of the new PS or else sleep for enough milliseconds.
Maintain the priority queue as a separate data structure from the thread's PS pool, so you don't have to lock the PS pool to change its "order." You could also implement alterations to the priority queue asynchronously, so they don't block continuation of work. You just have to trust that the state of the priority queue stays up to date enough.
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
krow
date: Jun. 10th, 2008 06:25 pm (UTC)
Link
Your LRU design is one I have considered. You could cache the original statement in the client and then cause a page fault back to the client if you do not have the statement (the client itself could keep track of PS). This has been my thought for years, but it has never been clear to me that the gain will be great enough to justify the expense.
Removing parse from the server buys you a smaller response time on the server... which I do find interesting since the server is the hard part to scale.
Also, pre-parsing would allow you to make much smarter proxy servers since they would have access to the parsed statements.
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
bkarwin
date: Jun. 10th, 2008 06:56 pm (UTC)
Link
I understand that converting the query into a binary protocol "should" make this immaterial, but the practical reality is that it will greatly complicate some kinds of troubleshooting. You haven't responded to this concern.
For example, a customer calls tech support: "I have an SQL query that works when I execute it from the 6.0 Java client, but the exact same SQL fails, or gives a different result, or performs much differently, when I execute it from the 6.1 C++ client."
What then?
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
krow
date: Jun. 10th, 2008 07:10 pm (UTC)
Link
Most clients are built on top of libmysql, so I am less worried about each client responding differently. SQL parse errors are still errors.
I suspect most people don't even need to worry about this... aka the performance gain is smaller.
Now! One other way to do this is to have the server send back the parsed lex structure to the client in the prepare state. This would keep the parser in the server, but would leave it up to the client to send the serialized structure.
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
bkarwin
date: Jun. 10th, 2008 07:40 pm (UTC)
Link
Suppose for the sake of argument we take Java out of the equation, and all my clients use libmysql. But some of them are running client 6.0.24 and some of them are running client 6.0.30 and some are running 6.1.2-beta.
All of these versions have different bugs, and perhaps whole language features if the version bumps. If the client contains the definition of the SQL language, that means your application must use a specific version of the client to work correctly. And all the clients in your deployment environment must have the same version to have assurance of working the same.
For instance, suppose MySQL 6.1 introduces support for a SQL feature, such as INTERSECT. Clients 6.0 doesn't support it, so to use that feature I'd have to upgrade all my clients. Whereas if the server was solely responsible for the SQL support and the client is leaner, then a mixed environment is more likely to just work.
Likewise more subtle differences like bugs can cause problems as you move more logic into the client.
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
krow
date: Jun. 10th, 2008 07:47 pm (UTC)
Link
This outside of the domain of this topic, but right now MySQL implements far more SQL then what is in general needed.
Your INTERSECT gives a good example of why keeping the parser centralized but shipping the bytecode back to the client would be good. The Client would just know "this byte array goes with this prepare statement". So to it it would be just a logical packet sent to represent the query.
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
bkarwin
date: Jun. 10th, 2008 10:12 pm (UTC)
Link
"Far more SQL than what is in general needed" may be true, but it is simultaneously true that MySQL lacks some SQL features that are, in general, needed. CHECK constraints are an obvious example.
Storing bytecode for prepared statements in the client doesn't accomplish much. If you parse in the server and store bytecode in the client, you've doubled the bandwidth requirement for a prepare().
The unbounded resource consumption is still an issue, it'll just crash the client before it crashes the server. I mean, wouldn't these chunks of bytecode accumulate as connection-scoped garbage? Persistent connections would accumulate bytecode and eventually that would clobber the client. I understand the general downsides of persistent connections, but people still use them.
So if I understand correctly, you're advocating moving persistent PS resources into the client, because it's preferable for clients to crash instead of servers, when app developers fail to close their PS's tidily.
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
krow
date: Jun. 11th, 2008 01:30 am (UTC)
Link
The bytecode for the execution will be a fraction of the original statement. Caching these would be trivial in size. The sending of them? Unless you are sending large amounts of data everything would still fit in a packet.
Putting the LRU in the client solves the client dieing.
Advantage?
* Smarter proxies
* Less memory in the server
* Faster execution on the server
The client should not leak memory. Well, if it does it means the developer is not calling close on the mysql structure.
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
bkarwin
date: Jun. 11th, 2008 01:50 am (UTC)
Link
If developers were to be diligent about calling close on mysql structures, wouldn't that also solve the original issue you described, of servers falling over, _today_, with no changes at all?
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
krow
date: Jun. 11th, 2008 02:00 am (UTC)
Link
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
volodymir_k
date: Jun. 11th, 2008 01:20 pm (UTC)
Link
Yet another story with security checks.
Client 1: select username, password from users;
Server: OK! your bytes are 78423459124701827489124.
Admin: Hm... REVOKE select (password) on users FROM client1;
Client 2: my bytes are 78423459124701827489124.
Server: OK: scott / tiger.
I still not convinced about "global locking is bad". Transaction isolation levels lead either to 2 solutions: versions (Interbase deadman) or locks (Oracle?). (A mix is possible, I know.) Either way RDBMS must centralize disk access. PS feature, how much really is it inefficient? 2%?
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
krow
date: Jun. 11th, 2008 01:34 pm (UTC)
Link
Here is the thing about authentication withe databases in the domain where I put most of my energy, aka the web. Its rare to find any large shop that bothers with it at all. Controlling access to the database via firewalls and source code is the common case.
Global locking is very costly for multi-ore and more especially multi-processor systems. Avoiding all of this is the trick right now.
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
ggiunta
date: Jun. 25th, 2008 10:35 pm (UTC)
Link
- since db handshake is costly, they always preached persistent connections even for web apps (at least, thats what I did)
- execution plans, query cursors etc are kept in the sga and shared between sessions iirc. This means using actually less memory when multiple connections are open at the same time, if you use PS
- a cap is (can be) put on all resources to prevent db being blasted away on a slashdot effect
- to optimize usage of stale persistent connections by idle client processes, they just released database-resident persistent connections. Claims are scalability is increased manyfold
I dunno how (if) they solved the "global locking is bad" problem, though...
Reply | Parent | Thread
Re: Throwing the baby out with the bathwater
from:
krow
date: Jun. 25th, 2008 10:51 pm (UTC)
Link
Reply | Parent | Thread
That will be more problamatic
from:
venudb
date: Jun. 10th, 2008 05:09 am (UTC)
Link
But having a parset @client end; gives more controlled options and few calls will be almost NOOP and can be returned directly from client and in case of prepared execution, it saves lot of network bandwidth as well.
Reply | Thread
Re: That will be more problamatic
from:
krow
date: Jun. 10th, 2008 07:20 am (UTC)
Link
Reply | Parent | Thread
I am shocked.
from:
volodymir_k
date: Jun. 10th, 2008 05:12 am (UTC)
Link
BTW, how about caching small results, are they per connection also? E.g. "select * from countries" < 2Kb, will there be N conn. * 2Kbytes in copies?
Reply | Thread
Re: I am shocked.
from:
krow
date: Jun. 10th, 2008 07:19 am (UTC)
Link
Other databases created a central cache... this has not been a real problems for others because:
1) They typically run fewer connections.
2) No one has the "many cores" problem solved.
The query cache is global, so small results are global.
Reply | Parent | Thread
Re: I am shocked.
from:
awfief
date: Jun. 11th, 2008 01:40 am (UTC)
Link
In reality, performance is often the #1 motivation, and risk assessment concludes that theory can take a backseat to better performance.
In the same way, prepared statements are often used as "best practice" for security against SQL injections. But they're not so great for performance.
Reply | Parent | Thread
Re: I am shocked.
from:
krow
date: Jun. 11th, 2008 01:46 am (UTC)
Link
Multi-statement support should have been done differently. If it had... then this problem would not exist.
Reply | Parent | Thread
Re: I am shocked.
from:
awfief
date: Jun. 11th, 2008 02:01 am (UTC)
Link
Reply | Parent | Thread
Re: I am shocked.
from:
volodymir_k
date: Jun. 11th, 2008 12:48 pm (UTC)
Link
Interesting that average job interview (esp.in MS shops) has snippet
Q. Why/when prep stmts are used?
A (in solemn tone). Because The Server can optimize data access path/plan for often-repeated queries.
That is/was true for DB/2, 10 years ago I saw they had compiler(s) for "embedded SQL" thing into really used DB2 bytecode.
So once we had to persuade Customer that "yes, prep stmts are slower than usual, we are not doing anything incorrect, and that's the reality, pls trust us." How things changed.
Reply | Parent | Thread
Re: I am shocked.
from:
krow
date: Jun. 11th, 2008 01:06 pm (UTC)
Link
Why is this?
On one side of the coin you get the benefit that you always have updated information on the tables so you can get an execution path which is more accurate.
The other side of the coin would say this is not changing all that quickly, so its a matter of how long to cache.
One bit of truth is that the MySQL optimizer is just simple enough that the cost is not extreme.
If you could get the query after it has been parsed, and then rewritten... that would be a good balance.
Reply | Parent | Thread
Curious
from:
tcepsa
date: Jun. 10th, 2008 03:35 pm (UTC)
Link
Reply | Thread
Re: Curious
from:
krow
date: Jun. 10th, 2008 06:17 pm (UTC)
Link
Would this solve it? Not really, but solving it could be done by just altering the protocol.
Reply | Parent | Thread
can't toss them out
from:
kostja_osipov
date: Jun. 25th, 2008 08:37 pm (UTC)
Link
Would be nice to toss them out completely, but we need to keep the technology, since the a stored procedure or a function is in a way just a collection of prepared statements.
So it seems we're stuck with having some prepared statements, at least internally, for some time.
And BTW, they do get better overall, one gotcha or another gets fixed once a few months.
Reply | Thread
Re: can't toss them out
from:
krow
date: Jun. 25th, 2008 09:07 pm (UTC)
Link
Currently I am fond of shipping back the byte code and letting the clients send it with each query. That way any proxy/whatever can inspect and make of the information. I and Monty went over this on the phone a few weeks ago and he could not find immediate issue with it.
Just need a little bit of time to code up an example and see how well it works.
Reply | Parent | Thread