A couple a months ago, I decided to properly enter the twenty first century by abolishing all paper administration in my home and going completely digital. I have about ten folders full of banking, insurance, pension and healthcare papers that I'd like to get rid off. Additionally, trying to find an invoice or receipt of some broken thingy to check whether it's still under warranty is a nightmare with just a shoebox full of them.
The requirements of the setup can be summarized as follows:
- A solution that renders paper documents in searchable PDF files
- Stores them somewhere, were they can be accessed both from a computer and mobile devices
- An index supporting full text search and preferably also search on meta data (creation date, file name, etc.)
- Proper backup and restore options
The idea for the solution was simple: get a document scanner and upload the documents into a premium Evernote account. As an extra measure, I'd keep a local copy of the scans on my server to ensure that I'd not loose my information even in the event that my Evernote account was compromised.
But... I forgot to take one thing into account: The WAF factor. My wife wasn't so pleased to hear about my ambitions. She didn't like the idea of not having paper copies anymore and the thought of having all our privacy sensitive information in the cloud did not appeal to her either. What if the NSA, what if Google, what if we lose everything. On top of that, she didn't want to use Evernote, because she really wanted to be able to put the scanned documents into different folder as she has alway been doing physically. The concept of Notebooks didn't appeal to her either.
So what to do? Storing everything in Google Drive would definitively have solved the folder issue and still kept everything searchable (Dropbox, for instance, does not support full text searching within PDF files at the time of writing), but as said, Google was out of the question.
I already have too much IT at home (I manage my own IMAP server and web site), so I'd wanted to avoid add more applications to support in the mix and also practice what I preach in my day job: don't do everything youself, but use SAAS/PAAS/IAAS whenever possible. But alas, I was out of options in this case.
I started exploring Open Source options that would fit the requirements and could run on a Linux VM on my server at home. Because of the folder requirement I added WebDAV support to the list of mandatory features.
The first try was using ownCloud. It's a feature rich suite of programs that on paper (no pun intended) offered everything I needed. It didn't offer thumbnails for the PDF files in the web interface, but I could overcome that. More problematic was the fact that it did not properly indexed the PDF documents and therefore full text search did not work as expected. I tried for a couple of hours to see whether I could find out why it failed, but it did not improve my confidence in the solution. Eventually, I discarded it.
Next try was Pydio. It has a really nice responsive web interface with beautiful large background images, but the search function is seriously impaired: it searches by default only on filename and you have to explicitly select the advanced option to be able to search in the full text. A definite big minus. When I tried to use the mobile website, somehow neither of my Android phones let me select the password field in the login form and therefore I was unable to login. Exit Pydio.
Later, a colleague at work after hearing my complaint (and making fun of my crazy efforts) suggested that what I really need was a proper document management system. I thought I'd give that a try and started looking for Open Source options. It turns out that most of the available packages lack serious features in the open version and only let you use the full product for a really corporate license fee. I took a look at OpenKM and SeddDMS, but I wasn't thrilled.
Then I discovered Alfresco. The Community Edition is quite complete. It doesn't scale as good as the Entreprise edition, but the whole solution is already a complete overkill for my use case. Getting it up and running took some effort, but more on that later.
First let me elaborate a bit on the input part: the scanner.