MySQL character set and collation

sunsetsystems wrote on Friday, May 01, 2009:

I’m planning to remove the current mess of CHARACTER SET and COLLATE clauses from sql/database.sql.  These seem unnecessary because they can be specified at the database level – and furthermore COLLATE may need to vary depending on the specific language (to get language-dependent sorting right).

A companion change will be to add these clauses to the "create database" statement in setup.php.

If anyone can think of a reason I should not do this, please speak up now!

Rod
www.sunsetsystems.com

bradymiller wrote on Friday, May 01, 2009:

This will have direct effects on the translation project.

Right now the majority of database is forced to utf-8 encoding with collation of characters strings by utf-8-unicode (values get default collation of utf-8-bin), along with an annoying amount of latin1 mixed in.

Forcing complete UTF-8 encoding in database is a good idea since most users have no clue what this is until it’s too late(is a real pain if a user realizes they need to migrate from latin1 to utf-8; lots of horror stories online).  If possible, agree would be cleaner to set the default UTF-8 encoding during the database creation (but is this possible, and what happens to users who create the shell database themselves?) instead of placing into database.sql.  I suppose could leave the collater up to the database but UTF8-unicode seems to be a good broad shotgun approach collator (again the concept of encoding/collation is not within the grasp of most users until its too late).

Once the entire database is forced to UTF-8 encoding we are then one step closer to having functional chinese/russian (there is a demand for this) Wouldstill need read/write utf-8 mysql command in library/translator.inc, ensure proper meta header tags on all the pages, and make the phpgacl utf-8 complaint, which will involve making another dumpfile and shortening some key).

What exactly are you planning to do?

-brady

sunsetsystems wrote on Friday, May 01, 2009:

My client wants Armenian language support.

Rod
www.sunsetsystems.com

sunsetsystems wrote on Friday, May 01, 2009:

If the user creates their own database, they will be responsible for setting its default character set and collation.

By the way individual tables and columns can still override the database-level default.

Rod
www.sunsetsystems.com

bradymiller wrote on Friday, May 01, 2009:

hey,

Won’t Armenian require full UTF-8 compliance? Meaning all mysql read/writes to mysql will need to be in utf-8 and all meta page headers with utf-8.  Also, if they want to use their characters for user names thenwill need to convert php-gacl database to utf-8 compliance.  This is good news that your working on this.

It’s funny, I was planning to go opposite direction (finish forcing everything at table level to to the utf-8 with the unicode vs bin collation).  We could compromise and force encoding of UTF-8 at the individual table level. Then it will default to utf-8_general collation unless the user specifies something else in their mysql.  This would nicely ensure everybody gets UTF-8 encoding and give flexibility of the collation (if my logic is correct).  This also wouldn’t leave out the people that create their own database shell.

-brady

sunsetsystems wrote on Friday, May 01, 2009:

Yep, Armenian needs utf-8.  Meta tags and perhaps also HTTP headers will need to tell the browser about it.  As far as MySQL reads/writes, I think “SET NAMES ‘utf8’” will take care of telling MySQL that character strings are in utf8, but let me know if you see a problem with that.  These are all things I plan to deal with.

Again, the database level specification will serve as defaults for the table level, so you may not have to do much there.  I’m not sure why utf8_bin collation is all over the place in database.sql, but I was going to put utf8_unicode_ci (needed for Armenian) in at the database level and let everything default to it, and see what happens.  Other languages will need other collations so I’ll figure out a way to make it selectable in setup.php.

I wasn’t aware of the phpgacl problem… will have to figure that out too.

Any further insight is appreciated – thanks!

Rod
www.sunsetsystems.com

bradymiller wrote on Friday, May 01, 2009:

hey,

This is too cool. Figured it would take me like a year to begin to implement this with my limited time.

The current utf8-bin/utf8-unicode separation is sort of clever but is a huge pain to maintain.  Agree with getting rid of the bin part(they are using this for numbers, since collation is much simpler), then can get rid of all of that set utf8 collate crap.  Even though your gonna force the database to UTF-8, you may still want to consider adding this to end of each table (Then your in effect forcing every install to UTF-8 even if make database shell first; I think this will be important since your gonna now be forcing read/writing of UTF8 to the mysql database in the new installs):
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Plus unicode seems like it pretty much works for everything.  Could always make that the default for the database on install.  Then the unusual user who wants something else can always directly change it through mysql (i think this is possible).  Then you avoid confusion during installation (most people are clueless to this stuff; it took me several weeks to develop any appreciation for this stuff).

The real key is removing any lingering latin1 encoding in the new installs or the users database will degrade.

I’m think a SET NAMES UTF8; will work.

Regarding the pages UTF8 header meta-tags.  These seem to be intermittently interspersed throughout. Firefox goes to UTF-8 automatically in OpenEMR (this is what is happening in the CVS demo), however all the other browsers (IE, google chrom, opera, safari) don’t go into UTF8 by default and special charcaters get jargoned up.

Regarding php-gacl.  Bascially there are about 4 lines that create keys that are too large (UTF takes up more bytes), which can easily be corrected if the dumpfile were of the normal text variety (google utf8 and phpgacl and you will see several projects that have explained the simple changes that need to be made).  The problem is the dumpfile is in an xml schema which doesn’t allow inserting limits to the key sizes.  My plan was gonna be to create the database from the current xml schema file, and then dump it into a text sql file, then make the needed changes, and then should be able to substitute this file for the current xml schema file.

I’m gonna send you an invite to the translation table on google docs.  Hopefully you can get your client to translate their Armenian here (very easy and get almost instant gratification in the cvs demo).

-brady

sunsetsystems wrote on Friday, May 01, 2009:

Looks like you can modify schema.xml directly to shorten the offending fields, before installing phpgacl.  I’ll play with it.

Thanks for the translation invite and info!

Rod
www.sunsetsystems.com

bradymiller wrote on Saturday, May 02, 2009:

hey,
After some thought, I realize how it’s much cleaner to just make the utf-8 stuff default in the database and not the tables or columns. We can easily warn the users in the setup.php script who chose to create their own database to use UTF-8.
This is very exciting.
-brady

sunsetsystems wrote on Saturday, May 02, 2009:

I have checked in a modified schema.xml to fix the phpgacl/utf8 problem.

Rod
www.sunsetsystems.com

bradymiller wrote on Saturday, May 02, 2009:

lol, I was trying to limit the actual key sizes without modifying the actual data columns sizes.  Didn’t even think to go route of modifying the actual data columns sizes, i’m feeling sort of like an idiot. Very nice indeed, and it worked in a UTF-8 database; let me know when you clean up the database.sql file so i can test it out.
-brady

sunsetsystems wrote on Saturday, May 02, 2009:

Thanks!  I have committed changes to database.sql and setup.php.  The latter will probably want some further refinement or documentation.

In globals.php you should also enable $GLOBALS[‘use_set_names_utf8’]; also try adding this near the end:

header("Content-type: text/html; charset=utf-8");

which I think may resolve the character set handling with various browsers, but will need some refinement to avoid breaking things like downloads and PDF output.  I’ll be working on that.

Rod
www.sunsetsystems.com

sunsetsystems wrote on Saturday, May 02, 2009:

Ooooh, do this in globals.php (near the front) instead of that "header" statement:

ini_set(‘default_charset’, ‘utf-8’);

I think that breaks nothing!

Rod
www.sunsetsystems.com

bradymiller wrote on Saturday, May 02, 2009:

Wow, I have commented out all the use_set_names_utf8 stuff (in globals and translation.inc), and placed the ini_set(‘default_charset’, ‘utf-8’); command at top of globals.php and it seems to work like magic. I just did a poor mans test (put chinese characters into a patient note, and it displayed the note back in chinese characters).  WOW, can it really be that easy. I’m blown away here.
-brady

bradymiller wrote on Saturday, May 02, 2009:

hey, even better.
1) Created the first openemr patient in chinese characters.
1) Created the first user/password in chinese characters.

To avoid confusion, should we just delete the set_names_utf8 stuff (only three lintes in codebase)

-brady

sunsetsystems wrote on Saturday, May 02, 2009:

Hmm, I think you’d still need use_set_names_utf8.  That forces communication with MySQL to use utf8.  The ini_set call concerns communication with the browser.

Perhaps you already have MySQL communication in utf8 by some other means?

Rod
www.sunsetsystems.com

bradymiller wrote on Saturday, May 02, 2009:

Your right. Doesn’t make sense.

I’ll get setup.php to write use_set_names_utf8 to true in globals.php (if $collate has a value).

-brady

bradymiller wrote on Saturday, May 02, 2009:

hey,
Check it out, setup.php now will set the use_set_names_utf8 setting in globals.php
-brady