Prepared Statements, Musings

« previous entry | next entry »
Jun. 9th, 2008 | 03:45 pm

In the back of my mind for the last couple of weeks have been some musings on prepared statements.

So I walk into this customers office and they don't speak english.

I only speak a few words of their language.

It goes like this:

"Blah, blah, blah, Prepared Statements, blah, blah..."

I say "Turn them off, they are what is crashing your application."

The end customer is confused because the translator had not finished before I gave the answer.

So I explain the problem... the customer has a lot of connections to the database... a lot... and the database is running out of memory and crashing. Was this the customer's problem? Yes it was.

This is a practiced conversation for me. I could do a song and dance on "well you should be building your application like..." but the customer wants to build their application their way. They want a lot of SQL to exist that they are going to generate on the fly. The shortest answer is "turn them off".

So why are prepared statements a problem?

Because users do not clean up/close unused prepared statements. Multiply the number of prepared statements times the number of open connections and you begin to see the problem.

What do you do about this?

Turn them off. The Java driver and a few other drivers turn them off automatically.

Why?

Because unless you really happen to know what you are doing with them, you are going to crash the database (somewhere in the back of my mind is "this is why we moved to garbage collected languages").

I'm not going to defend the design.

Are they faster?

Well "yes" if you have blobs, otherwise... still "yes" for most cases.

Each connection thread in MySQL keeps its own copy. There is no global pool of prepared statements on the server.

And if there was a global pool?

We would need a bunch of read/write locks. Locks are a real problem right now across the industry. I do not believe that any application being built today which has to share data/state has this solved.

Locks harm scaleability. Sure... you could keep a cache of them local to the connection that reads from a central cache, but this would just be pushing out the problem with caching (which is the curren trend in scaling work arounds). The problem is that we see MySQL set ups with thousands of connections, so even this solution is going to chew up a lot of memory.

So what am I thinking?

Ditch the current design completely. Push the parser out to the client, and let it send out a stripped down bytecode language which can be de-serialized back to something we can execute on.

Just skip the entire "lets put this in the server" approach for prepared statements. No cache needed, and you push processing off to the client where it belongs. Will it require a new extension to the protocol? Yes.

The client library would also get a bit bigger, but that does not concern me as much. You can pull sql_yacc.yy and a few functions into a library and get away from the problem of supporting two parsers.

To someone writing against the database there would be no change, dynamic SQL would continue to work as is. The difference is all in what the client is doing internally, and what it is shipping over the wire.

Heck, its even modular at this point :)

Just thinking about a solution...

Link | Leave a comment | Share

Comments {32}

deja vu

from: jimw
date: Jun. 9th, 2008 11:07 pm (UTC)
Link

that idea actually came up in the sorrento meeting, by antony if i remember correctly.

you'd want some sort of cross-language parser generator, so that we wouldn't have to implement parsers in c, java, c#, python, perl, etc.

Reply | Thread

Brian "Krow" Aker

Re: deja vu

from: krow
date: Jun. 9th, 2008 11:21 pm (UTC)
Link

The idea has been floating around for a bit... it has just crossed my mind in the last couple of weeks that perhaps there is no solution to fixing the current PS design. Just toss it.

Reply | Parent | Thread

Bytecodes

from: kazuho
date: Jun. 10th, 2008 12:15 am (UTC)
Link

> Push the parser out to the client, and let it send out a stripped down bytecode language

Yeah! I definitely agree that this is the way to go. It would no only improve server performance, but also opens the possibility of hand-coding access to storage engines (for better performance than what SQL optimizers can achieve).

Reply | Thread

Throwing the baby out with the bathwater

from: bkarwin
date: Jun. 10th, 2008 01:20 am (UTC)
Link

This hearkens back to the days when 4GL was a buzzword, and SQL was parsed at compile-time and converted into low-level API for a binary query protocol. Dynamic SQL, sent to the server and parsed and optimized at runtime, was at that time the new-fangled method of running SQL.

Now static SQL is a distant memory, and everyone uses dynamic SQL. Why? Because now the server determines the SQL language support uniformly for all clients. You have greater assurance that the SQL language is identical for all clients connected to a given server, regardless of the client. Think of tech support nightmare when the SQL language support varies by the client programming language, client version, and operating system!

Also, you don't have to upgrade your clients when a language bug is fixed. Upgrade the server, and voila! All clients benefit. Deployment of SQL fixes and enhancements is so much easier.

There have to be alternative solutions to the problem you mention, of leaking PS data structures. Set a server cap on the memory allocation to these data structures. Implement an LRU algorithm to toss out unused PS. Make this behavior configurable on the server, enabled by default.

Your reasoning sounds like, "people frequently fill up and fragment their hard disk because they never empty their trash can (a true statement), so we should make sure no one can use a hard disk." WTF?!?

Reply | Thread

Brian "Krow" Aker

Re: Throwing the baby out with the bathwater

from: krow
date: Jun. 10th, 2008 01:26 am (UTC)
Link

You are missing the point. You would still have dynamic SQL in your application, the change would be that instead of sending raw SQL to the server, you could optionally send it pre-parsed into bytecode.

AKA no change to how applications are written today. This is about what is happening under the hood.

An LRU with a global cache still has locks, and MySQL doesn't do the second trick that many prepared statement designs does, which is to store the execution plan. So really the current implementation just gives you a way to send objects without escaping them (and a slightly parsed tree).

BTW I updated the post to clarify the point that dynamic SQL is still available after doing this. I hated embedded SQL.



Edited at 2008-06-10 01:41 am (UTC)

Reply | Parent | Thread

Re: Throwing the baby out with the bathwater

from: bkarwin
date: Jun. 10th, 2008 03:07 pm (UTC)
Link

I do understand the difference between compile-time parsing, runtime parsing in the client, and runtime parsing in the server. I also understand the issue with scalability for a resource that requires locking.

I just can't understand the logical progression from "resource constraint on server" to "move parser to client". Won't that just change the location of the resource constraint problem, instead of solving the problem? Perhaps you have made some mental steps connecting the two but they are still in your head. :-)

I do understand that by moving the PS resource management to the client, it's less likely to crash the server. But if you ever want the server to cache optimization plans, you'd be back at square one, because I can't believe you're suggesting to move optimization to the client as well.

What about setting the resource cap per thread, and use a priority queue to implement an LRU at the thread level? When a PS is used, it moves to the beginning of the priority queue and assign it the current timestamp. When the PS pool is full, knock off the entry at the end, under the assumption it is the least recently used.

If the PS at the end has a timestamp so recent that you don't feel it's "stale", then either error out the creation of the new PS or else sleep for enough milliseconds.

Maintain the priority queue as a separate data structure from the thread's PS pool, so you don't have to lock the PS pool to change its "order." You could also implement alterations to the priority queue asynchronously, so they don't block continuation of work. You just have to trust that the state of the priority queue stays up to date enough.

Reply | Parent | Thread

Brian "Krow" Aker

Re: Throwing the baby out with the bathwater

from: krow
date: Jun. 10th, 2008 06:25 pm (UTC)
Link

One of the tricks to scaling anything is to push processing out to nodes (aka don't write stored procedures, keep your business logic out in your application server where it belongs). The more you can distribute out the better off you are.

Your LRU design is one I have considered. You could cache the original statement in the client and then cause a page fault back to the client if you do not have the statement (the client itself could keep track of PS). This has been my thought for years, but it has never been clear to me that the gain will be great enough to justify the expense.

Removing parse from the server buys you a smaller response time on the server... which I do find interesting since the server is the hard part to scale.

Also, pre-parsing would allow you to make much smarter proxy servers since they would have access to the parsed statements.

Reply | Parent | Thread

Re: Throwing the baby out with the bathwater

from: bkarwin
date: Jun. 10th, 2008 06:56 pm (UTC)
Link

I still think the issue that trumps this is that one server will have to respond to a mix of clients, each of which have their own slightly different idea of what language they're parsing.

I understand that converting the query into a binary protocol "should" make this immaterial, but the practical reality is that it will greatly complicate some kinds of troubleshooting. You haven't responded to this concern.

For example, a customer calls tech support: "I have an SQL query that works when I execute it from the 6.0 Java client, but the exact same SQL fails, or gives a different result, or performs much differently, when I execute it from the 6.1 C++ client."

What then?

Reply | Parent | Thread

Brian "Krow" Aker

Re: Throwing the baby out with the bathwater

from: krow
date: Jun. 10th, 2008 07:10 pm (UTC)
Link

Java is a pain, mainly since it lacks a clear way to use a MySQL parser (this could be solved by picking something other then Bison though... generate code to native language).

Most clients are built on top of libmysql, so I am less worried about each client responding differently. SQL parse errors are still errors.

I suspect most people don't even need to worry about this... aka the performance gain is smaller.

Now! One other way to do this is to have the server send back the parsed lex structure to the client in the prepare state. This would keep the parser in the server, but would leave it up to the client to send the serialized structure.

Reply | Parent | Thread

Re: Throwing the baby out with the bathwater

from: bkarwin
date: Jun. 10th, 2008 07:40 pm (UTC)
Link

Now you may be missing the point. :-)

Suppose for the sake of argument we take Java out of the equation, and all my clients use libmysql. But some of them are running client 6.0.24 and some of them are running client 6.0.30 and some are running 6.1.2-beta.

All of these versions have different bugs, and perhaps whole language features if the version bumps. If the client contains the definition of the SQL language, that means your application must use a specific version of the client to work correctly. And all the clients in your deployment environment must have the same version to have assurance of working the same.

For instance, suppose MySQL 6.1 introduces support for a SQL feature, such as INTERSECT. Clients 6.0 doesn't support it, so to use that feature I'd have to upgrade all my clients. Whereas if the server was solely responsible for the SQL support and the client is leaner, then a mixed environment is more likely to just work.

Likewise more subtle differences like bugs can cause problems as you move more logic into the client.

Reply | Parent | Thread

Brian "Krow" Aker

Re: Throwing the baby out with the bathwater

from: krow
date: Jun. 10th, 2008 07:47 pm (UTC)
Link

Unless we freeze the SQL that is spoken :)

This outside of the domain of this topic, but right now MySQL implements far more SQL then what is in general needed.

Your INTERSECT gives a good example of why keeping the parser centralized but shipping the bytecode back to the client would be good. The Client would just know "this byte array goes with this prepare statement". So to it it would be just a logical packet sent to represent the query.

Reply | Parent | Thread

Re: Throwing the baby out with the bathwater

from: bkarwin
date: Jun. 10th, 2008 10:12 pm (UTC)
Link

Heh. Well, your emoticon notwithstanding, it should be noted that you can't freeze against bugfixes. There will always be genuine defects whose fix requires changing SQL either syntactically or semantically.

"Far more SQL than what is in general needed" may be true, but it is simultaneously true that MySQL lacks some SQL features that are, in general, needed. CHECK constraints are an obvious example.

Storing bytecode for prepared statements in the client doesn't accomplish much. If you parse in the server and store bytecode in the client, you've doubled the bandwidth requirement for a prepare().

The unbounded resource consumption is still an issue, it'll just crash the client before it crashes the server. I mean, wouldn't these chunks of bytecode accumulate as connection-scoped garbage? Persistent connections would accumulate bytecode and eventually that would clobber the client. I understand the general downsides of persistent connections, but people still use them.

So if I understand correctly, you're advocating moving persistent PS resources into the client, because it's preferable for clients to crash instead of servers, when app developers fail to close their PS's tidily.

Reply | Parent | Thread

Brian "Krow" Aker

Re: Throwing the baby out with the bathwater

from: krow
date: Jun. 11th, 2008 01:30 am (UTC)
Link

The bandwidth would not double. The initial call for prepare is static, and today the "'ok" response wastes a lot of bandwidth. Unless you have some huge SQL statement (like 32K plus... and this is without data), there will be no cost.

The bytecode for the execution will be a fraction of the original statement. Caching these would be trivial in size. The sending of them? Unless you are sending large amounts of data everything would still fit in a packet.

Putting the LRU in the client solves the client dieing.

Advantage?

* Smarter proxies
* Less memory in the server
* Faster execution on the server

The client should not leak memory. Well, if it does it means the developer is not calling close on the mysql structure.

Reply | Parent | Thread

Re: Throwing the baby out with the bathwater

from: bkarwin
date: Jun. 11th, 2008 01:50 am (UTC)
Link

"The client should not leak memory. Well, if it does it means the developer is not calling close on the mysql structure."

If developers were to be diligent about calling close on mysql structures, wouldn't that also solve the original issue you described, of servers falling over, _today_, with no changes at all?

Reply | Parent | Thread

Brian "Krow" Aker

Re: Throwing the baby out with the bathwater

from: krow
date: Jun. 11th, 2008 02:00 am (UTC)
Link

Developers would have to learn to release the PS on the server as they reused the connections (aka they hold the connections open through multiple client interactions).

Reply | Parent | Thread

Владимир

Re: Throwing the baby out with the bathwater

from: volodymir_k
date: Jun. 11th, 2008 01:20 pm (UTC)
Link

Please also note: clients that cache some plan have no way to know that server would prefer since now changed plan. E.g. after "alter table"s, "create index"es etc clients will continue to send inefficient garbage. Schema version checks needed?

Yet another story with security checks.
Client 1: select username, password from users;
Server: OK! your bytes are 78423459124701827489124.
Admin: Hm... REVOKE select (password) on users FROM client1;
Client 2: my bytes are 78423459124701827489124.
Server: OK: scott / tiger.

I still not convinced about "global locking is bad". Transaction isolation levels lead either to 2 solutions: versions (Interbase deadman) or locks (Oracle?). (A mix is possible, I know.) Either way RDBMS must centralize disk access. PS feature, how much really is it inefficient? 2%?

Reply | Parent | Thread

Brian "Krow" Aker

Re: Throwing the baby out with the bathwater

from: krow
date: Jun. 11th, 2008 01:34 pm (UTC)
Link

Any sort of cached plan will have to toss an error if changes have occurred (this is true today... does not matter where the caching happens for this case).

Here is the thing about authentication withe databases in the domain where I put most of my energy, aka the web. Its rare to find any large shop that bothers with it at all. Controlling access to the database via firewalls and source code is the common case.

Global locking is very costly for multi-ore and more especially multi-processor systems. Avoiding all of this is the trick right now.

Reply | Parent | Thread

Re: Throwing the baby out with the bathwater

from: ggiunta
date: Jun. 25th, 2008 10:35 pm (UTC)
Link

Oracle seems to be moving in the opposite direction than you are proposing here:
- since db handshake is costly, they always preached persistent connections even for web apps (at least, thats what I did)
- execution plans, query cursors etc are kept in the sga and shared between sessions iirc. This means using actually less memory when multiple connections are open at the same time, if you use PS
- a cap is (can be) put on all resources to prevent db being blasted away on a slashdot effect
- to optimize usage of stale persistent connections by idle client processes, they just released database-resident persistent connections. Claims are scalability is increased manyfold

I dunno how (if) they solved the "global locking is bad" problem, though...

Reply | Parent | Thread

Brian "Krow" Aker

Re: Throwing the baby out with the bathwater

from: krow
date: Jun. 25th, 2008 10:51 pm (UTC)
Link

Oracle has been going that direction for a long time. It really is a different beast then what MySQL and the derived versions are. MySQL is more about enabling lots, and lots of connections while Oracle relies more on the pooling of connections (which keeps down lock contention).

Reply | Parent | Thread

That will be more problamatic

from: venudb
date: Jun. 10th, 2008 05:09 am (UTC)
Link

Thats exactly what we used to do in our 'last' company and you believe or not that ended up with more problems; and finally we ended by moving everything to server..

But having a parset @client end; gives more controlled options and few calls will be almost NOOP and can be returned directly from client and in case of prepared execution, it saves lot of network bandwidth as well.

Reply | Thread

Brian "Krow" Aker

Re: That will be more problamatic

from: krow
date: Jun. 10th, 2008 07:20 am (UTC)
Link

What sort of problems did you hit?

Reply | Parent | Thread

Владимир

I am shocked.

from: volodymir_k
date: Jun. 10th, 2008 05:12 am (UTC)
Link

I am a developer, and generally we do not write code "for MySQL". Usual approach is "let's have JDBC run with MySQL first, and Customer might change this to MS SQL/Oracle in the future for the performance/religion matters". Yes we noticed that prep.stmts are not much faster (sometimes 1000 times slower) that usual, and that makes us beware MySQL. Somehow other RDBMS solved the problem with shared data in memory.

BTW, how about caching small results, are they per connection also? E.g. "select * from countries" < 2Kb, will there be N conn. * 2Kbytes in copies?

Reply | Thread

Brian "Krow" Aker

Re: I am shocked.

from: krow
date: Jun. 10th, 2008 07:19 am (UTC)
Link

I agree on principle that people should not be developing for any particular database.

Other databases created a central cache... this has not been a real problems for others because:

1) They typically run fewer connections.
2) No one has the "many cores" problem solved.

The query cache is global, so small results are global.

Reply | Parent | Thread

awfief

Re: I am shocked.

from: awfief
date: Jun. 11th, 2008 01:40 am (UTC)
Link

Sure, on principle, don't develop to a database....also, all schemas should be normalized, and transactions should be used for perfect data integrity.

In reality, performance is often the #1 motivation, and risk assessment concludes that theory can take a backseat to better performance.

In the same way, prepared statements are often used as "best practice" for security against SQL injections. But they're not so great for performance.

Reply | Parent | Thread

Brian "Krow" Aker

Re: I am shocked.

from: krow
date: Jun. 11th, 2008 01:46 am (UTC)
Link

Prepared statements are typically emulated in the driver (which solves the SQL injection issue). You don't need the server support for that.

Multi-statement support should have been done differently. If it had... then this problem would not exist.

Reply | Parent | Thread

awfief

Re: I am shocked.

from: awfief
date: Jun. 11th, 2008 02:01 am (UTC)
Link

I'm not saying that prepared statements inherently adds a performance load. Just that with MySQL, they do. And making the overall point that people don't take things into account when they're planning (if they plan). They take the word of "best practice" without looking at all the angles.

Reply | Parent | Thread

Владимир

Re: I am shocked.

from: volodymir_k
date: Jun. 11th, 2008 12:48 pm (UTC)
Link

> But they're not so great for performance.

Interesting that average job interview (esp.in MS shops) has snippet
Q. Why/when prep stmts are used?
A (in solemn tone). Because The Server can optimize data access path/plan for often-repeated queries.

That is/was true for DB/2, 10 years ago I saw they had compiler(s) for "embedded SQL" thing into really used DB2 bytecode.

So once we had to persuade Customer that "yes, prep stmts are slower than usual, we are not doing anything incorrect, and that's the reality, pls trust us." How things changed.

Reply | Parent | Thread

Brian "Krow" Aker

Re: I am shocked.

from: krow
date: Jun. 11th, 2008 01:06 pm (UTC)
Link

MySQL is a bit different. Most databases store the execution plan for prepared statements, MySQL does not.

Why is this?

On one side of the coin you get the benefit that you always have updated information on the tables so you can get an execution path which is more accurate.

The other side of the coin would say this is not changing all that quickly, so its a matter of how long to cache.

One bit of truth is that the MySQL optimizer is just simple enough that the cost is not extreme.

If you could get the query after it has been parsed, and then rewritten... that would be a good balance.

Reply | Parent | Thread

Philomath

Curious

from: tcepsa
date: Jun. 10th, 2008 03:35 pm (UTC)
Link

Wouldn't putting that implementation in the client lend itself to a greater possibility of SQL Injection-type attacks? Or are there other details around that which would prevent end-users from executing arbitrary SQL? Am I still living in a stone age by using prepared statements to protect against that?

Reply | Thread

Brian "Krow" Aker

Re: Curious

from: krow
date: Jun. 10th, 2008 06:17 pm (UTC)
Link

The way multi-SQL statements are sent is why SQL injection attacks are possible. That should have been done differently in the client in the first place.

Would this solve it? Not really, but solving it could be done by just altering the protocol.

Reply | Parent | Thread

Konstantin Osipov

can't toss them out

from: kostja_osipov
date: Jun. 25th, 2008 08:37 pm (UTC)
Link

Running out of memory is a yet another issue we will get fixed over time, just like we got Bug#27430 and others fixed, albeit only recently.

Would be nice to toss them out completely, but we need to keep the technology, since the a stored procedure or a function is in a way just a collection of prepared statements.

So it seems we're stuck with having some prepared statements, at least internally, for some time.

And BTW, they do get better overall, one gotcha or another gets fixed once a few months.

Reply | Thread

Brian "Krow" Aker

Re: can't toss them out

from: krow
date: Jun. 25th, 2008 09:07 pm (UTC)
Link

Antony and Eric's design for SP does not have this limitation (aka the need for prep stmt). There are a few ways to skin this cat.

Currently I am fond of shipping back the byte code and letting the clients send it with each query. That way any proxy/whatever can inspect and make of the information. I and Monty went over this on the phone a few weeks ago and he could not find immediate issue with it.

Just need a little bit of time to code up an example and see how well it works.

Reply | Parent | Thread