'basename' php method for non-english characters

system · November 6, 2016, 7:55am

shachar058 wrote on Sunday, November 06, 2016:

Hi all,
The ‘basename’ php method is used all over the project inorder to get the filename from a given path. The problem with this is, that if the filename is in non-english (e.g. hebrew) letters, the method erases these letters and returns empty. My solution for this was substituting this standard method with a customized method:

function basename_nonenglish($path){
$parts = preg_split(’~[\\/]~’, $path);
foreach ($parts as $key => $value){
$encoded = urlencode($value);
$parts[$key] = $encoded;
}
$encoded_path = implode("/", $parts);
$encoded_file_name = basename($encoded_path);
$decoded_file_name = urldecode($encoded_file_name);
return $decoded_file_name;
}

What are your thoughts on this?
Thanks,
shachar

system · November 12, 2016, 10:21am

bradymiller wrote on Saturday, November 12, 2016:

Hi Shachar,

Thanks for taking this issue on. Note there are 2 more somewhat related issues:

When importing documents, non-standard characters are converted to _ .
We use a function to prevent directory walking that is US centric.
openemr/sanitize.inc.php at master · openemr/openemr · GitHub

For your above function, will it work on both Linux and Windows?

Also, lets give it a better name (maybe basename_international or something like that) and place it in the library/sanitize.inc.php library (this library is included in every script via globals.php).

thanks,
-brady
OpenEMR

system · November 14, 2016, 10:11am

shachar058 wrote on Monday, November 14, 2016:

Hi!
Several questions:

Is #1 a current issue/bug that has to be taken care of? Or is it working functionallity that you want to be sure isn’t wrecked by my addition?
Can you elaborate more on the ‘directory walking’ issue?
My function will work both on Linux and Windows because both of them accept forward slash as directory seperator
Ok will give that name and put in requested file. Is this file automatically loaded or will I have to include it in files that want to use it?

Thank you!
shachar

system · November 14, 2016, 10:40am

bradymiller wrote on Monday, November 14, 2016:

Hi Shacar,

The library/sanitize.inc.php script is automatically loaded, so won’t need to include it.

To explain the other issues, which will likely cause your group problems down the road now that you are using document names with non-latin characters.
Currently document names when imported are converted(all non latin characters are converted to _) via this function:

github.com

openemr/openemr/blob/master/library/sanitize.inc.php#L31


    die(xlt("ERROR: The following variable contains invalid characters").": ". attr($label));
}
}


// Convert all illegal characters to _
function convert_safe_file_dir_name($label)
{
return preg_replace('/[^A-Za-z0-9_.-]/', '_', $label);
}


//Basename functionality for nonenglish languages (without this, basename function omits nonenglish characters).
function basename_international($path)
{
$parts = preg_split('~[\\\\/]~', $path);
foreach ($parts as $key => $value) {
    $encoded = urlencode($value);
    $parts[$key] = $encoded;
}


$encoded_path = implode("/", $parts);
$encoded_file_name = basename($encoded_path);

And currently when processing documents names in several places, the following function is used(to prevent directory walking which throws an error when non-latin characters are used):

github.com

openemr/openemr/blob/master/library/sanitize.inc.php#L25




// If the label contains any illegal characters, then the script will die.
function check_file_dir_name($label)
{
if (empty($label) || preg_match('/[^A-Za-z0-9_.-]/', $label)) {
    error_log("ERROR: The following variable contains invalid characters:" . $label);
    die(xlt("ERROR: The following variable contains invalid characters").": ". attr($label));
}
}


// Convert all illegal characters to _
function convert_safe_file_dir_name($label)
{
return preg_replace('/[^A-Za-z0-9_.-]/', '_', $label);
}


//Basename functionality for nonenglish languages (without this, basename function omits nonenglish characters).
function basename_international($path)
{
$parts = preg_split('~[\\\\/]~', $path);
foreach ($parts as $key => $value) {

As an aside, an example of directory walking is the following. If you have a script that brings in a file name. However, a malicious user may use the following for a file name “…/…/…/secret_files/passwords” etc. This is what the above check_file_dir_name() function is used to prevent.

-brady
OpenEMR

system · November 20, 2016, 8:52am

shachar058 wrote on Sunday, November 20, 2016:

Hey Brady,

Ok so fixed the method name and placed in the sanitize file (pushed it to pull request).
As for the other functions (which interpret all non-latin to underscore) how would you suggest approaching this? Should I change the regex to a pattern that will accept all letters except for certain problematic characters? If so, which characters should these be?

Thanks again!
shachar

system · November 20, 2016, 9:49am

bradymiller wrote on Sunday, November 20, 2016:

Hi,

Looks good and tests well, so brought it into the codebase.

Regarding the other functions, using whitelisting(ie. only allow certain characters) is the most secure method, but I don’t see how it will be possible to do this with the huge number of international characters.

This code from another project appears to do a good job blacklisting bad stuff, so could use that:

github.com

vito/chyrp/blob/35c646dda657300b345a233ab10eaca7ccd4ec10/includes/helpers.php#L516-L518


$strip = array("~", "`", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
               "}", "\\", "|", ";", ":", "\"", "'", "&#8216;", "&#8217;", "&#8220;", "&#8221;", "&#8211;", "&#8212;",
               "—", "–", ",", "<", ".", ">", "/", "?");

For the functions in question, would add a ‘anal’ parameter to them (as the above code does in the function) that defaults to false, and does do the whitelisting also (as the function lists above does); then there is the option to use this when we are certain there are no international characters involved.

thanks,
-brady
OpenEMR