Collation question

petersmm · August 17, 2021, 2:21am

Situation
I have a database with standard US character sets and patient names. We have not until this point usefully used Unicode to store any foreign characters. Current database is from version 5.0.8 with collation latin1_swedish_ci. Does changing directly to utf8mb4_general_ci cause any loss of data. I have no reason to believe we have used anything but really standard US ascii chars(a-z, 1-9, and other english normative chars) in our data. If NO for changing to utf8mb4_general_ci can we change to utf8_general_ci? Why or why not?

My understanding is that going to a more advanced collation does not affect your data, but moving a less advanced (or notably different) collation will cause data loss, is this notion correct?

Thank you.

brady.miller · August 17, 2021, 4:02am

hi @petersmm ,

Check out this documentation on going from latin1_swedish_ci to utf8:
https://www.open-emr.org/wiki/index.php/OpenEMR_UTF-8_Upgrade_Howto
(number 2 will pertain to you since you are just using standard US characters, so you should be fine on straight conversion; would go straight to utf8mb4 encoding (and utf8mb4_general_ci collation) and would upgrade your database to 6.0 before doing that)

btw, the kinds of characters that will get mangled on latin1 to utf8 are non standard US characters, such as Spanish names that have accent marks above them, for example.

can see example code here for changing encoding/collation of a database:
openemr-devops/devtoolsLibrary.source at master · openemr/openemr-devops · GitHub
(also ensure the database itself is changed(not sure if code above does that) so that tables added in future upgrades use the correct encoding/collation)

brady.miller · August 17, 2021, 4:04am

btw, if you plan to stay in OpenEMR version 5, then just go to utf8 (utf8mb4 is not supported in version 5).

And also make sure you set the correct flags in sites/default/sqlconf.php .

petersmm · August 17, 2021, 9:03pm

IIRC in version 3, I did that utf8 conversion then. $disable_utf8_flag = false; is the value you are referring to? I read the documentation on that file. I am assuming it should stay false because we are using utf8, correct? Also, to iterate – assuming I am on utf8 charset defaults, already (and have been): collate to utf8_general_ci and then upgrade to 6.0 and then collate to utf8mb4_general_ci for posterity? Do I have it right?

petersmm · August 19, 2021, 10:19pm

Also,

A follow-up: After running collation, are things like this normal? It would seem collation normalizes data where possible, so my own instincts tell me yes.

brady.miller · August 20, 2021, 6:16am

hi @petersmm ,
You are correct on the $disable_utf8_flag setting (I am assuming this is OpenEMR 5 and 6; if it is 6, then there is a new flag/setting to set there).
That registered email null and Registered id with all those boxes looks really odd. Would be helpful to know what is showing up in those fields in the database before and after running the utf8 encoding/collation conversion (in the product_registration sql table)