Documents saving in DB

zhhealthcare wrote on Monday, October 11, 2010:

Question for the experts:   Why is it that we don’t save the PDF documents into the database?  Is it a bad idea? Is it not better to save it into the DB for security?

Thanks
Sam

sunsetsystems wrote on Monday, October 11, 2010:

I would say mainly because the documents store can be hundreds or thousands times larger than the rest of the database.  For many sites this would make it difficult to do frequent backups.  With documents stored as files there are good tools for making differential backups, i.e. transferring only the newly added or changed files.

Rod
www.sunsetsystems.com

tmccormi wrote on Monday, October 11, 2010:

First documents aren’t just PDFS, some are JPG, GIF, TIFF, Text, PNG etc …  Dr Bowen kept all his docs in the Database for years and recently stopped.  Mostly due to backup performance and data conversion upgrades being too long and complex, if I recall.  OpenEMR use performance was fine.

This is kind of a religious topic… :-)   I personally favor storing documents on  a fast file system, it allows outside tools to access the content normally for other kinds of operations like fax and email.   If it’s in the database you have to write it to disk before you can send it out….

zhhealthcare wrote on Tuesday, October 12, 2010:

I have been told that PHP has functions to process zip files.  IS it a good idea to have files zipped up periodically , say files in monthly zip files and store it.  Then when needed it can be processed and pulled up.  This is in the interest of practices who have a lot of records and wants to organize files more efficiently.

Thanks
Sam

sunsetsystems wrote on Tuesday, October 12, 2010:

I don’t think zipping will accomplish much, as most image files are pretty well compressed already.

Rod
www.sunsetsystems.com

zhhealthcare wrote on Tuesday, October 12, 2010:

It may not reduce the size but will organize the files more effectively.

visolveemr wrote on Wednesday, October 13, 2010:

Hi,

Most of the commercial and open source databases supports the data type what is called ‘BLOB’ which is used for storing images, non text documents like PDF and other ‘binary’ data.Several industry application uses this data type to store images in the data base. However, as discussed earlier in the forum, it is the trade off between performance, functionality and storage. Again, there is no hard and fast rule on one should store PDF files inside or outside the Data base.

Here is some of the pros and cons of storing pdf files inside the data base:
(let us assume the practice have a 3000 patient and each patient has 10 different ‘blob’ objects like scans, pdf etc)

Pros:
1. All associated data will be stored under one roof - all DB files. Maintenance of the storage is easy.
2. More secured because there is no files just has the individual documents. That is the data can be accessed only thru DB.
3. Accident removal of pdf files wont happen as the data will be inside the DB.
4. Maintenance is easy as one can use the DB tools like DB back up etc for the entire practice of data.

Cons:
1. More storage - Insert of ‘blob’ will take at least twice the size of the data (one for data and one for logging)
       So, if you have to insert 50 GB of image data then you need at least 100 GB of disk space.
2. Slower performance -
      As discussed in the forum, backing up the data base will take significantly longer time.

Thanks
ViCarePlus Team
www.vicareplus.com
services@vicareplus.com

verbus wrote on Wednesday, October 13, 2010:

Greetings,

It has been my experience that the Pros should win as disk space is cheap.

Verbus

ytiddo wrote on Wednesday, October 13, 2010:

One of the cons visolve failed to mention, and that was mentioned earlier by tony:
Storing files as files on a file system let you use native processing programs on them. Think search functions.

Its my opinion that filesystems are the best tool for storing files, especially since it makes it convenient to use external search tools on documents.

zhhealthcare wrote on Wednesday, October 13, 2010:

My question is : how secure is it to have files outside of the DB, and does it comply with Hipaa or other security standards?  If it does then in order to organize the files more efficiently outside of the DB is it better to organize files by zipping.

Sam

penguin8r wrote on Wednesday, October 13, 2010:

Since we’re having this discussion, let me throw in my $0.02.  From my 3+ years of ongoing experience installing & maintaining multiple instances of OpenEMR in the real world, it’s much better not to have you documents buried in the database.  If you look at other open source document management or data storage systems, almost all of them store the actual files outside of the database.
If you system is properly set up, security is not an issue.  Maybe that’s a problem on Windows but I have not & never will setup an OpenEMR install on a Windows backend, so it’s a non-issue for me at least.
The advantages for incremental backups & snapshot backups of the database have already been mentioned, as well as having the ability to access the files with other tools.
Everyone likes to say disk space is cheap, that’s true to some extent.
However, fast, redundant, disk space (think SAS RAID1 or RAID6) is still not exactly cheap.
When a user does something that corrupts the database & you have to figure out what’s wrong, you’re not sorting through a half terabyte or larger DB.

sunsetsystems wrote on Wednesday, October 13, 2010:

My question is : how secure is it to have files outside of the DB, and does it comply with Hipaa or other security standards? If it does then in order to organize the files more efficiently outside of the DB is it better to organize files by zipping.

Security is a matter of details, not of the storage medium.  I would not use zipping as a method of organizing files.  Again, having a sensible backup strategy is a primary consideration and tools like rsync are very useful for that.

penguin8r wrote on Wednesday, October 13, 2010:

^^^ Exactly what he said.
Zipping a PDF, tif, or jpeg file gains you nothing, & adds another layer of complexity when you need to view the file.
The files are already organized per patient, what more would be necessary?
Once a system has been running for a while & you’ve got 100+GB of documents, why back up the same thing over & over every night when less than 1% of the actual volume of stored data actually changes on a daily basis?

drbowen wrote on Monday, October 18, 2010:

Dear Sam:

I have substantial experience in my office with storing all documents in MySQL InnoDB tables.  In fact I have 80 gigabytes in a single InnoDB table.  The following are things that I learned the hard way.

MyISAM:

1)MyISAM tables are easy to corrupt.  I had a small temporary server running my office without battery backup when a car hit the transformer up the street at 10 AM on a Monday.  MyISAM are usually very easy to fix.

2)MyISAM tables have a 4 gigabyte hard limit.  This crashed the Apache web server at 10 AM on a Saturday.  It was creepy to see a file that had exactly 4,000,000,000 bytes but that was the clue I needed tio solve the problem.

3)There is no referential integrity.  I have had one patient in my database that had a null pid. Don’t ask me how.  And why the system worked with a null pid I still don’t understand. (pubpid=450).

4)We have lots of patients in our data base with identical pubpids.

InnoDB:

1)Referential integrity can be enforced

2)It is necessary to modify the my.cnf to auto increment.  The usual defaults are way too small. Auto increment allows your table
to get quite large.

3)Using this feature allows real time back up by mirroring your data base changes using the built in Master Slave Replication.

4)Real time backups can be done over the internet to remote servers.

5)InnoDB tables can span across multiple hard drives and or servers allowing really gargantuan tables.

6)Single point of defense.  An attacker can get to your hard drive remotely easier than he can crack an appropriately defended MySQL database.  This can be ameliorated by encrypting the hard drive.

7)Backing up an 80 gigabyte table, or moving it for some purpose takes about 4 hours with our hardware.  Rsync doesn’t help much on an 80 gigabyte table that is always changing.

8)Backing up the table in this fashion means shutting the database engine.

9)Performance in real time is terrific.  My files in the InnoDB table at 80 gigabytes open faster than the current documents directory.

I already have the code written and will be glad to share this with you.  Tony keeps urging me to turn in the code to the project which I will be glad to do.  It was written 6 years ago and may need to be cleaned up to match current project standards.

Sam Bowen, MD
http://oemr.org

mdsupport wrote on Friday, October 22, 2010:

In our experience, appropriate solution for handling ‘documents’ has to be tailored to individual situations.  Current OpenEMR capabilities are are minimal - appropriate for a small practice with relatively low nbr of documents.  Looking ahead as the requirements to share information with various interested parties becomes pervasive, this will become a problem. 

Has anyone experimented with using a document management system with OpenEMR?  Our time can be better leveraged in establishing integration points to generic information exchange standards and let a real Content Management System do the management work.  Check out www.alfresco.com for one such solution.

anonymous wrote on Friday, October 22, 2010:

I have more than 10 years of experience in content management systems like Documentum, Filenet, SharePoint…

Storing documents in database is NOT a good idea.  Store metadata in database, and documents in filesystem.

SharePoint stores documents in database, but can’t handle large content management systems.

My startup practice has more 10GB documents in 1 year.  If we choose documents to be stored in database, can’t imagine daily full database backup, and trasmit to online backup centers…

hrivera787 wrote on Saturday, October 23, 2010:

Has been a wile, I have been very busy lately.

Anyway, storing IMAGES, DOCUMENTS, into the database can be done.

Just make sure to make two tables one for the INDEX and the other one for the DOCUMENT’s DATA. This way you wont suffer from SLOW performance. I had this problem before for one of my Web Application.

Remember it’s does not matter how many space you have. With the data and the index is stored together, the hard drive heads have to move  big time to scan the INDEX and then give you the BLOB.

If you separate the INDEX in one table and the BLOB data into another table, the hard drive will seek the INDEX and give the BLOB quite fast.

Do the test, I had this problem before.

- TEST 1: BLOB + INDEX = SLOW
- TEST 2: BLOB ONE TABLE, INDEX ONE TABLE = FAST

In the INDEX table:
* ID
* FILENAME
* DESCRIPTION
* FILESIZE
* METATAGS
* KEYWORDS
* MIME TYPE

In the BLOB Table:
* INDEXID <- Relate to INDEX
* BLOB

With this configuration you can do the JOB, and the hard drive stays happy.
The key benefits of DATABASING the DOCUMENTS:

1. You can SORT, SEEK PATIENT DOCUMENTS, FILTER, ect. More faster, intead of scanning the FS.
2. You don’t have to mess around with FS.
3. You don’t have to work with certain FS like Windows (FATXX), Linux (EXT*,REISER,ect) , Unix(EXT,REISER,ect), Apple(EXT,REISER,ect), ect.

But, on the BACKUP issue, all the above comments are RIGHT!, it will be a BIG BIG BIG UGLY backup file. But either way you have to BACKUP the documents too!. Is part of the patient documents right?

But:
* For those who make backups via email this will be a BIG ISSUE
* For those who make backups via FTP prepare to wait big time, sit down and prepare a cup of COFFEE.
* For those who have money and have a NAS (Network Attached Storage) or SAN (Storage area network), no problem.

How to resolve the problem?:
Make a option on the configuration, that let you choose to backup the DOCUMENTS or not.
* Backup Documents Data YES/NO

In my opinion:
I think is no issue to keep the documents live on the FS (File System), but on the SECURITY point of view, is more secure in the database.