UTF-8 conversion

bradymiller wrote on Saturday, May 16, 2009:

hey,

I’ve converted all apparent html headers and mysql connections to UTF-8 in openemr, php-gacl, and postnuke (calendar).  phpmyadmin was automatic ready with the new version we got. I put which files required modification below.  There is still definitely more sleuthing to do to ensure no html headers creep in and ensure no other places for mysql queries (the dutch group stuff especially).  I also put a flag in library/sqlconf.php for those that want to disable it, however there should be no need for this (this stuff is fully compatible with utf-8 and latin1 databases).

Now it’s time to start thinking about how to get the mbstring ( http://us.php.net/mbstring ) stuff working; these are php’s functions to deal with multibyte string (ie. so the trim function removes a full character and not just one byte, which would corrupt multi-byte characetrs such as chinese).  Php’s mbstring has an overload thing-ma-jig ( http://us.php.net/manual/en/mbstring.overload.php ) in php.ini so it looks like it can possibly be set to automatically using the mulitbyte functions instead of the non-multibyte functions;  So, this shouldn’t involve any changes in source code; will be php settings to figure out (can also replace all functions in source with the equivalent, but I’m guessing this would be break users that don’t have php mbstring installed).  Some older php’s will need this mbstring stuff installed, but seems like all the newer version come with it.

Possibly php.ini mbstring stuff with something like (I’m jJUST GUESSING for now):
mbstring.func_overload       = 7
mbstring.language        = Neutral    ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding    = UTF-8    ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On    ;  HTTP input encoding translation is enabled
mbstring.http_input        = auto    ; Set HTTP input character set dectection to auto
mbstring.http_output        = UTF-8    ; Set HTTP output encoding to UTF-8
mbstring.detect_order        = auto    ; Set default character encoding detection order to auto
mbstring.substitute_character    = none    ; Do not print invalid characters

Rod, have you looked into this mbstring stuff at all on your related Armenian work?

Rod, how embedded is this dutch group stuff? Do you think it can all be removed safely? This would make me feel more comfortable that all mysql call are going through sql.inc ?

thanks,
brady

I put the specific files that got modified for above utf-8 stuff below:
openemr:
html header: openemr/interface/globals.php
mysql connection: openemr/openemr/library/sql.inc
phpgacl:
html header: openemr/gacl/admin/templates/phpgacl/header.tpl
mysql connection: openemr/gacl/gacl.class.php
postnuke:
html header: not found, doesn’t appear to be set anywhere (so should default to openemr).
mysql connection: openemr/interface/main/calendar/config.php , openemr/interface/main/calendar/includes/pnAPI.php

sunsetsystems wrote on Sunday, May 17, 2009:

There’s good information about mbstring here: http://us2.php.net/manual/zh/ref.mbstring.php

Here’s my initial, and perhaps naive, take on it.

OpenEMR doesn’t do much “processing” of user-supplied text.  It’s mostly just providing form fields to enter blobs of text, saving those same blobs in the database, and displaying them in HTML.  The trim() function in particular is used mostly to remove starting and ending spaces, and the space is not a multibyte character (but perhaps the last byte of some multibyte characters might look like a space…? not sure.).  So I think most of OpenEMR doesn’t care how many bytes it takes to encode a character entered by the user.

There are surely exceptions, but I think those need to be looked at on a case-by-case basis, not depending on any wholesale conversion or overloading of all string functions.

Re the Dutch stuff, I think most of it can be removed easily enough, though some areas may need a bit of analysis.

In any case, I’m not sure UTF-8 encoding should be our default.  At least not until we are much more confident that we are handling all the details properly.

Rod
www.sunsetsystems.com

sunsetsystems wrote on Monday, May 18, 2009:

By the way I found this assertion at http://osdir.com/ml/php.internationalization/2003-05/msg00010.html : "the default behaviour of trim() is *known* to be multibyte safe".

I assume this "default behavior" refers to the trimming of (single-byte) white space.  In any case there is no mb_trim() function which suggests that the mbstring developers were not very worried about it.

Rod
www.sunsetsystems.com