Log Performance

kevmccor wrote on Wednesday, April 11, 2012:

Yeah, it is never so simple …  If you want, point me to example queries and I will try to come up with more in the regular expression area.  If the tokenizing approach is superior, then I would have to learn that to be any help.

yehster wrote on Wednesday, April 11, 2012:

For select statements, assuming that a table name to log about follows "FROM "  is likely to work reasonably well.

bradymiller wrote on Wednesday, April 11, 2012:

Hi,

One of the clever things about the current logging engine (created by Visolve) is that by logging all queries, one can always argue that everything is logged (and note it requires no maintenance). Also, the table does not need to be known to exist to get logged; that mechanism is only there to “categorize” the log. For example, an INSERT into dated-reminders table does get logged (as all INSERTS do), but gets categorized as a “other-insert”.

A critical issue that I have noticed (could effect certification) is that binded variables are not logged (there is a question mark there rather than data), which needs to be dealt with. The logging feature needs to dump these binded variables into the comments code somehow (my suggestion is to send the binded array as another parameter to the audit engine and then dump these into a variables sql column which can be displayed to right of comments in the log viewer; this will then allow forensic methods to recreate/analyze the queries).

-brady
OpenEMR

bradymiller wrote on Wednesday, April 11, 2012:

Also,

If the query has the typical patient id in it(such as pid or patientid) which is a real value, then these should be recorded as the pid, rather than using the session variable. (This also seems critical since it appears to be required that the logs need to be sortable by patient id).

-brady
OpenEMR

kevmccor wrote on Wednesday, April 11, 2012:

The regular expression I’ve come up with is:  ‘/(?:from|into|update)\s{0,1}(\w+)/i’

I copied example statements from a couple of websites, and it does seem to get the table name.

I’ve got to get back to work.

$stmts = array();
$stmts[] = "SELECT 'COL1', 'COL2', 'COL4' FROM mY_Table2 WHERE MYVAL='MYNUM";
$stmts[] = "SELECT 'COL1', 'COL2', 'COL4' FROM \"mY_Table2\" WHERE MYVAL='MYNUM";
$stmts[] = "SELECT 'COL1', 'COL2', 'COL4' FROM 'mY_Table2' WHERE MYVAL='MYNUM";
$stmts[] = "INSERT 'COL1', 'COL2', 'COL4' INTO mY_Table2 WHERE MYVAL='MYNUM"; 
$stmts[] = "INSERT INTO 'mY_Table' ('column1', 'column2') SELECT 'column3', 'column4' FROM 'table2'";
$stmts[] = "INSERT INTO Store_Information (store_name, Sales, Date) SELECT store_name, Sales, Date FROM Sales_Information";
$stmts[] = "UPDATE Store_Information SET Sales = 500 WHERE store_name = \"Los Angeles\" AND Date = \"Jan-08-1999\" WHERE Year(Date) = 1998";
$stmts[] = "DELETE FROM Store_Information WHERE store_name = 'Los Angeles'";
$stmts[] = "SELECT REPLACE(region_name, 'ast', 'astern') FROM Geography;";
$stmts[] = "UPDATE customers SET salesperson = 'Mike' WHERE state = 'NH';";
$stmts[] = "SELECT DISTINCT customers.customer_id, customers.customer_name FROM customers INNER JOIN orders ON customers.customer_id = orders.customer_id;";
$stmts[] = "SELECT customers.* FROM customers LEFT JOIN orders ON customers.customer_id = orders.customer_id WHERE orders.customer_id IS NULL;"; 
$stmts[] = "SELECT customer_state, COUNT(customer_id) As total FROM customers GROUP BY customer_state;";
foreach($stmts as $stm) {
	$tm2 = preg_match('/(?:from|into|update)\s[\\\'"]{0,1}(\w+)/i', $stm, $match2);
	//var_dump($tm2) .PHP_EOL;
	//var_dump($match2) .PHP_EOL;
	if (isset($match2[0]) ){  echo 'capture: ' . trim($match2[0], " :':\"").PHP_EOL; } else { echo "Missed " . substr($stm, 0, 20).PHP_EOL; }
	if (isset($match2[1]) ){  
		echo 'Table after: ' . trim($match2[1], " :':\"").PHP_EOL; 
	} else {
		echo "Missed table".PHP_EOL;
	}
}

yehster wrote on Wednesday, April 11, 2012:

I like it… This RegEx should be more robust than my tokenizing approach.

yehster wrote on Thursday, April 12, 2012:

Some preliminary profiling, AuditSQLEvent execution time goes from about 7ms to 3ms. Doesn’t seem like much, but multiplied by the number of times it gets called  the impact on calendar load time is ~2 seconds to less than 1. 

Here are a couple of queries which the regex misses. because of ` (backticks) surrounding the table name and lots of parentheses for joins.

Missed table SELECT pnotes.id, pnotes.user, pnotes.pid, pnotes.title, pnotes.date, pnotes.message_status, IF(pnotes.user != pnotes.pid,users.fname,patient_data.fname), IF(pnotes.user != pnotes.pid,users.lname,patient_data.lname), patient_data.fname, patient_data.lname FROM ((pnotes LEFT JOIN users ON pnotes.user = users.username) JOIN patient_data ON pnotes.pid = patient_data.pid)
Missed table SELECT pnotes.id, pnotes.user, pnotes.pid, pnotes.title, pnotes.date, pnotes.message_status, IF(pnotes.user != pnotes.pid,users.fname,patient_data.fname) as users_fname, IF(pnotes.user != pnotes.pid,users.lname,patient_data.lname) as users_lname, patient_data.fname as patient_data_fname, patient_data.lname as patient_data_lname FROM ((pnotes LEFT JOIN users ON pnotes.user = users.username) JOIN patient_data ON pnotes.pid = patient_data.pid)
	 From Sort Up 	 Patie

SELECT dr.pid, dr.dr_id, dr.dr_message_text,dr.dr_message_due_date, u.fname ffname, u.mname fmname, u.lname flname FROM `dated_reminders` dr JOIN `users` u ON dr.dr_from_ID = u.id JOIN `dated_reminders_link` drl ON dr.dr_id = drl.dr_id Missed table SELECT pd.title ptitle, pd.fname pfname, pd.mname pmname, pd.lname plname FROM `patient_data` pd

mcaloon wrote on Thursday, April 12, 2012:

I think the regex needs to accommodate the JOIN with something like

preg_match('/(?:from|into|update|join)\s[\\\'"]{0,1}(\w+)/i', $stm, $match2)

perhaps as table names follow join in the ANSI compliant syntax (i think)

bradymiller wrote on Thursday, April 12, 2012:

Hi,

This should also help to speed things up:
http://github.com/bradymiller/openemr/commit/de2e46885e49f710765e8f2ccbd9f0dad4f1410a

With above mod, all translation functions (xl(), xl_list(),etc.) will now completely skip the audit/logging engine.

-brady
OpenEMR

kevmccor wrote on Thursday, April 12, 2012:

For the backtick issue, try adding the backtick to the character class  part
So now the regex looks like:

'/(?:from|into|update|join)\s[\\\'"`]{0,1}(\w+)/i'

If that is legible.  The three slashes are needed to escape the ’ single quote. It has to do with PCRE delimiters and other mysterious syntax.  You could also try this:

"#(?:from|into|update)\s['\"`]{0,1}(\w+)#i"

Notice now the delimiters are changed, regex in double quotes, and double quote escaped in the character class.

kevmccor wrote on Thursday, April 12, 2012:

This one got the pnotes and the dated_reminders table.  I added the \({0,2} which escapes opening parentheses and allows 0, 1, or 2 parentheses before the capturing (\w+).  RegEx syntax is tricky, so whenever I use them I try to keep everything as simple as possible.  This one looks pretty maxxed-out to me.

"#(?:from|into|update|view)\s['\"`]{0,1}\({0,2}(\w+)#i"

bradymiller wrote on Friday, April 13, 2012:

Hi,

Two things:

1. Committed my above commit to sourceforge (with a couple minor mods).

2. Here’s a bug fix to ensure that the binded values are also logged. Please review it:
http://github.com/bradymiller/openemr/commits/log-fix_1

-brady
OpenEMR Project

kevmccor wrote on Friday, April 13, 2012:

Another mod to the regex.  I moved the opening parentheses into the character class and allowed 0 to 3 occurrences of those characters.  All tables following from|into|update|view in my my sql examples were obtained.  I suppose the exhaustive test would require listing all forms of queries that OpenEMR generates.

"#(?:from|into|update|view)\s['\"`(]{0,3}(\w+)#i"

bradymiller wrote on Saturday, April 14, 2012:

Hi,
The commit I posted above to place binded variables in the log is testing well, so I committed it to sourceforge.
-brady
OpenEMR

kevmccor wrote on Saturday, April 14, 2012:

In /library/adodb/adodb.inc.php line 92, there is this constant defined:

	// allow [ ] @ ` " and . in table names
		define('ADODB_TABLE_REGEX','([]0-9a-z_\:\"\`\.\@\[-]*)');

This might be useful, because some of these characters will not match to the perl word characters (\w+)
Now the regex is made like this, for example:

	$dbtableregex = '([]0-9a-z_\:\"\`\.\@\[-]*)';  // ADODB_TABLE_REGEX
	$tm2 = preg_match("#(?:from|into|update|view)\s['\"`(]{0,3}$dbtableregex#i", $stm, $match2);