Log Performance

system · April 9, 2012, 10:09pm

yehster wrote on Monday, April 09, 2012:

Even when no entry is created for the log for “simple selects” as noted in the comments,
every query still loops through about 60 string comparisons of trying to determine the “type of event” based on the table name in the query.

In log.inc around line 610…

    foreach ($tables as $table => $value) {
        if (strpos($truncated_sql, $table) !== FALSE) {
            $event = $value;
             break;
        }
      else if (strpos($truncated_sql, "form_") !== FALSE) {
            $event = "patient-record";
             break;
        }
    }

This is a horrible “linear search” through the 60 or so table definitions which start around line 520 in the same file.

Even simple things like xl which don’t need to be logged still go through this serious set of string operations.

This is “big loop” which gets called large numbers of times on every page is undoubtedly a big factor in scalability/performance of OpenEMR

system · April 9, 2012, 11:17pm

sunsetsystems wrote on Monday, April 09, 2012:

The high level design as well as the implementation of audit logging are both very troublesome. It needs to be redone in a way that does not tie logging so directly to database access and does not require parsing SQL. Lots of code will be affected (though in a straightforward way).

Rod
www.sunsetsystems.com

system · April 9, 2012, 11:49pm

tmccormi wrote on Monday, April 09, 2012:

We need to be careful here. Any significant changes to this area of the code may trigger the need to re-certify that module. Which won’t be free. Audit logging is a very large part of the certification process.

Not saying we should not improve it, but it needs to be able to pass all the exact tests that we passed before, at least.

-Tony

system · April 9, 2012, 11:52pm

yehster wrote on Monday, April 09, 2012:

My feeling is that we should explictly log access events in the .php with high level descriptions, or maybe log at " the page level" access instead of logging every sql query.
(e.g. demographics.php accessed by user on pid at time X:XX)

system · April 10, 2012, 12:03am

tmccormi wrote on Tuesday, April 10, 2012:

You must log every change to the database AND every access to the data. The Audit log must be able to report who changed what and when. Just logging page access is not adequate to pass certification. While it does not require store the exact SQL statement used there are benefits to the model being used which allow you to reverse changes (like accidental deletes) with relative ease.

This is the specific requirement from: http://healthcare.nist.gov/docs/170.302.r_AuditLog_v1.0.pdf

o Record actions related to electronic health information. Information to be recorded includes
the date, time, patient identification, user identification, an indication of which actions
occurred, and by whom to be recorded when electronic health information is created,
modified, accessed, or deleted; and
o Enable a user to generate an audit log based on specific time parameters, and sort entries in
the audit log according to the elements specified in the standard.

-Tony

system · April 10, 2012, 1:17am

sunsetsystems wrote on Tuesday, April 10, 2012:

it needs to be able to pass all the exact tests that we passed before, at least

Tony, where would the details of those be? I.e. to evaluate how much can be changed without expensive re-testing.

Rod
www.sunsetsystems.com

system · April 10, 2012, 2:32am

tmccormi wrote on Tuesday, April 10, 2012:

I just posted the link to the tests that we passed, but ICSA labs as the official recording documents of our test. They (and CMS) are the final judge. I’m just saying unless the development community has the ability to come up with retesting funds if we end up needing them then be careful about redesigns.

On the good side the new 2014 guide lines have some proposed wording about 'Substantially Unchanged" which refers to functionality not code…. but that’s 2014 not Stage one 2011/2012.

-Tony

system · April 10, 2012, 2:00pm

kevmccor wrote on Tuesday, April 10, 2012:

Is there some way to subdivide the $tables into sub arrays, i.e. $tables , then find which subarray the ‘$tables as $table’ belongs to, then loop through the subarray? This could reduce the number of iterations significantly, as long as it is certain everything was properly checked. Just a thought, since I am still a complete noob with most of the OpenEMR code.

system · April 10, 2012, 4:09pm

bradymiller wrote on Tuesday, April 10, 2012:

Hi,

Could improve the performance by stopping logging of the translation engine:
1. Make a sqlStatementNoLog() function that uses the ExecuteNoLog() function and yanks the auditSQLEvent() stuff (from the sqlStatement() function in library/sql.inc).
2. Then use this sqlStatementNoLog instead of sqlStatement here (note all translation functions pass through this function)(library/translation.inc.php):

// Translation function
// This is the translation engine
//  Note that it is recommended to no longer use the mode, prepend, or append
//  parameters, since this is not compatible with the htmlspecialchars() php
//  function.
function xl($constant,$mode='r',$prepend='',$append='') {
  // set language id
  if (!empty($_SESSION['language_choice'])) {
    $lang_id = $_SESSION['language_choice'];
  }
  else {
    $lang_id = 1;
  } 
  if ($lang_id == 1 && !empty($GLOBALS['skip_english_translation'])) {
    // language id = 1, so no need to translate
    $string = $constant;
  }
  else {
    // TRANSLATE
    // first, clean lines
    // convert new lines to spaces and remove windows end of lines
    $patterns = array ('/\n/','/\r/');
    $replace = array (' ','');
    $constant = preg_replace($patterns, $replace, $constant);
    // second, attempt translation
    $sql="SELECT * FROM lang_definitions JOIN lang_constants ON " .
      "lang_definitions.cons_id = lang_constants.cons_id WHERE " .
      "lang_id='$lang_id' AND constant_name = '" .
      add_escape_custom($constant) . "' LIMIT 1";
    $res = SqlStatement($sql);
    $row = SqlFetchArray($res);
    $string = $row['definition'];
    if ($string == '') { $string = "$constant"; }
    
    // remove dangerous characters
    $patterns = array ('/\n/','/\r/','/"/',"/'/");
    $replace = array (' ','','`','`');
    $string = preg_replace($patterns, $replace, $string);
  }
    
  $string = "$prepend" . "$string" . "$append";
  if ($mode=='e') {
    echo $string;
  } else {
    return $string;
  }
}

-brady
OpenEMR

system · April 10, 2012, 7:06pm

yehster wrote on Tuesday, April 10, 2012:

What I’m thinking maybe is to tokenize the SQL string using space as delimiter. Since $tables is an associative array/hash table, we can search quickly to see if each token shows up in $tables (much quicker than the repeated stripos calls) until we determine the event type.

I also think that declaring $tables outside of the AuditSQLEvent function instead of inside it may also give some free speed up. No need to rebuild that array every time through the loop.

system · April 10, 2012, 7:18pm

tmccormi wrote on Tuesday, April 10, 2012:

Gotta say, it’s really cool to see this level of technical discussion around performance ….
-Tony

system · April 10, 2012, 8:35pm

kevmccor wrote on Tuesday, April 10, 2012:

yehster’s comment got me to look at a regular expression

$qs = ‘SELECT ALL FROM MYTABLE WHERE MYVAL=MYNUM’;
$tm = preg_match(‘/\w+(?= where)/i’, $qs, $match);
var_dump($tm) .PHP_EOL;
var_dump($match) .PHP_EOL;

Then, if I understand what yehster is getting at

if (isset($table[ $match ]) { … we have a table name in the array … }

system · April 10, 2012, 9:05pm

yehster wrote on Tuesday, April 10, 2012:

@Other Kevin,
Exactly…
https://github.com/yehster/openemr/commit/eb11d5087fedcb69e5fe1c19f41fd7bebae5e333

if (isset($table[ $match[0] ]) { ... we have a table name in the array ... }

Time for each array lookup multiplied by the number of tokens in the query should be much smaller than the strpos operation multiplied by 60 table names for when the table isn’t even one which access needs to be logged.

system · April 10, 2012, 9:23pm

kevmccor wrote on Tuesday, April 10, 2012:

I don’t understand tokens or tokenizing very well - I think it is each word is a token. But you said “multiplied by the number of tokens.” My thought is to get to the table name ASAP, so I think it would be a good idea to have an idea of which token is the table name, or at least which could not be the table name. Also, if your scheme adds enough efficiency, you could use an alternative check as well, until confidence in the method is 100%. BTW, how would you know the table names are always identified?

system · April 10, 2012, 9:31pm

sunsetsystems wrote on Tuesday, April 10, 2012:

Beware of tokenizing the contents of string literals in SQL. Can cause incorrect results.

Rod
www.sunsetsystems.com

system · April 10, 2012, 9:49pm

yehster wrote on Tuesday, April 10, 2012:

Tokenizing by space makes the assumption that the table name is surrounded by spaces. String literals which get chopped up by Tokenizing shouldn’t cause a problem since either it would have matched incorrectly using the old method, or the quotes which will still get included in the tokens won’t match in the array lookup. Also the truncated_sql operation which is unchanged for now should strip out most literals.
Alternately instead of Tokenizing we could make an assumption that the table name follows from, into or update in the statement.

system · April 10, 2012, 10:40pm

sunsetsystems wrote on Tuesday, April 10, 2012:

I suggest writing a “get next token” routine that recognizes and discards string literals. When you get to the first table name, check that name and call it good. It won’t be foolproof but it will be a lot better than what’s there.

Rod
www.sunsetsystems.com

system · April 11, 2012, 1:59pm

yehster wrote on Wednesday, April 11, 2012:

Also to note, a number of new features like patient reminders (both clinical and dated) don’t have their access logged.

In addition, I believe that the current mechanism won’t really handle a bunch of things regarding which patient data being accessed belongs to because it only ever uses the session pid to identify patients. So when viewing message/reminders of any kind regarding one patient while a different patient is “active” won’t work. As another example even if we added the patient_reminders table to the list of logged tables, it won’t log which patients had updated reminders when running the batch.

One other thing, I’ve been searching/reading information regarding the logging requirements, and I’m pretty sure that you don’t have to generate a checksum for each change. The requirement to use SHA1 encryption or better is to make sure that the audit log itself hasn’t been tampered with, or when exchanging clinical information in an export. It’s not about tracking changes in the system itself.

system · April 11, 2012, 2:15pm

kevmccor wrote on Wednesday, April 11, 2012:

I spent a few minutes playing with the regular expression. Assuming the table name always preceeds the “where” part of the query, this one should give the table name.

$tm = preg_match('/[ \\\'"]\w+[ \\\'"](?= where)/i', $qs, $match);
var_dump($tm) .PHP_EOL;
var_dump($match) .PHP_EOL;
if (isset($match[0]) ){  echo 'Table ' . trim($match[0], " :':\""); }

system · April 11, 2012, 2:32pm

yehster wrote on Wednesday, April 11, 2012:

There are many cases in which the table name won’t precede WHERE, specifically table alias, or joins.

select f.name FROM foo as f WHERE f.name ='kevin'

as a quick example.