?

Log in

No account? Create an account

UTF8, Do we really need anything else?

« previous entry | next entry »
Jun. 24th, 2008 | 09:44 pm

I was working on a problem today and I asked myself a question that I keep find myself coming back to.

For the web, we will keep this in context, do we really need to concern ourselves about supporting anything other then UTF8?

For example here is a partial list of character sets MySQL supports:
http://dev.mysql.com/doc/refman/5.0/en/charset-mysql.html

What do I find when I visit web shops?

The ones who either do not know anything about character sets, or do not care, just keep with whatever the default MySQL shipped with.

Otherwise? It is UTF8.

Unless of course they do not know, but it turned out it did matter. Those folks are typically sweating out a future where they know they will be altering all of their tables.

The question remains though, does it really matter? How about UC2?

Link | Leave a comment |

Comments {8}

(no subject)

from: jamesd
date: Jun. 25th, 2008 09:00 am (UTC)
Link

As you say, most places just ignore it but I don't expect large enterprises to be as willing to. "Can you store the names and addresses of our customers?" sounds like a fine checklist item. "In one table?" could be another.

Web service providers might want to do things like storing the names associated with email addresses somewhere when selling email worldwide.

Sure, use binary, forget collations, and handle it yourself is the answer today. Or "use English". We don't even support UTF-8 fully, since we stop at three bytes.

How do we expect a web services provider to target an enterprise that wants to store the names and addresses of customers in China, Taiwan, Japan, Korea, Russia, Saudi Arabia and England in one place?

I still expect most web places to ignore this and use utf-8 with only three bytes and a latin collation.

Reply | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Jun. 25th, 2008 04:40 pm (UTC)
Link

Bar is working on the four byte bits to UTF8 and I have faith in him getting it done.

As far as the other character sets go... I suspect we shall see them slip into oblivion.

Reply | Parent | Thread

Brett Morgan

(no subject)

from: domesticmouse
date: Jun. 25th, 2008 11:30 am (UTC)
Link

About the only area that seems to do something other than ascii and isn't that utf8 friendly is japan. They do an awful lot of Shift JIS and other strangeness. But then, they tend to know about encoding issues. For some strange reason. =)

Reply | Thread

Brian "Krow" Aker

(no subject)

from: krow
date: Jun. 25th, 2008 04:10 pm (UTC)
Link

I spoke to one of the Mixi.jp guys and according to them UTF8 is just fine. I have seen the Shift JIS stuff... but never outside of the web market.

Reply | Parent | Thread

(Deleted comment)

Brett Morgan

(no subject)

from: domesticmouse
date: Jun. 25th, 2008 10:16 pm (UTC)
Link

Apparently a lot of the older phone handsets talk shift jis not utf8. Which is a pain. =)

Reply | Parent | Thread

Arjen Lentz

depends on MySQL support/implementation

from: arjen_lentz
date: Jun. 26th, 2008 02:22 am (UTC)
Link

I think the answer to this pondering somewhat depends on MySQL.
right now, any UTF8 field uses 3x N bytes in temp tables, during sorts, etc. That's a resource hog (memory, CPU, and potentially even ending up tmp tables and sorts on disk).
Even with this, MySQL does not support full UTF8 (as Wikipedia knows and Domas reminds us of frequently).

UTF8 support in PHP 5.x is not perfect.

There are apps out there for whom plain latin1 is fine. With the above in mind, sometimes latin1 is actually a better choice. If MySQL became more efficient, and with PHP 6 sorting out the UTF8 factor there, the choice could indeed be to just always go UTF8.

Edited at 2008-06-26 02:23 am (UTC)

Reply | Thread

UTF-16

from: burtonator
date: Jul. 7th, 2008 06:26 pm (UTC)
Link

I've been meaning on using for storing multi-byte languages in Spinn3r.

Right now we just use compressed UTF-8 for everything.... For the multi-byte stuff we could save on storage capacity by using UTF-16.

That said, we use the JDBC driver for the encoding so it's not really part of the DB.

MySQL seems all our strings as latin1....

Kevin

Reply | Thread