Reply to comment
This post is the second in a series about going paperless.
Going paperless would of course be rather straight forward when all communication is already electronic and all you have to do is storing it somewhere where it can be indexed and organized in your preferred way.
Unfortunately, quite some important artefacts are still provided only on paper. Therefore, the first step to set towards going paperless is to provide a way to convert the paper artefacts as smoothly as possible into electronic searchable documents.
This conversion starts with selecting a document scanner fit for the job. In my case fit for the job required a device that is capable of scanning two-sided multi-page documents with sufficient speed (>10 pages per minute). Colour scanning would be a nice plus. Since I am aiming to implement a fully open source solution, the availability of Linux drivers and ideally SANE support is required as well.
These requirements put you mostly in the range of small business devices, which are offered by several manufacturers. Epson for instance offers with the DS 510 a device that from the specifications should be a good fit.
I could continue listing some additional options, before concluding with what almost any source on the internet concludes as well: the Fujitsu ScanSnap iX500 scanner is probably the best suitable scanner in the market for medium workloads. It has support for Windows and Mac, it scans directly into several cloud storage solutions, it has built-in WiFi to integrate into your home network, but it also has SANE support, so it's well usable in a Linux environment. With a price around €450,=, it doesn't come cheap, but for having a reliable scanner it is definitively worth it.
Since the Epson was much less available in the Netherlands, I settled for the SnapScan.
Getting the ScanSnap to work in SANE for Debian/Ubuntu based Linux distributions is really simple. Just install the command line utilities:
apt-get install sane-utils
and scan your first document with the
scanimagecommand line tool:
scanimage --batch='out%04d.pnm' -y 298 --source 'ADF Duplex'
The arguments I used are
--source. The first tells
scanimage to scan all pages in the feeder and the second indicates the height of the document to be scanned in millimetres. The amount is slightly larger than the height of an A4 page, which allows the additional carrier sheet (used to scan several small items like receipts or photos at once) to be fully processed. The third option (
--source ADF Duplex) tells
scanimage to scan both the front and the back of each page. By default,
scanimage only scans the front. If you have multiple scanners connected, also add the device with the
-d option. I my case:
-d fujitsu:ScanSnap iX500:85330.
After some trial and error, I added two additional arguments to the command:
scanimage --batch='out%04d.pnm' -y 298 --source 'ADF Duplex' --swskip 15 --swdespeck 1
The first argument (
--swskip) ensures that empty pages are skipped and the value of 15 is an empirically derived threshold that work best for a variety of documents. The
--swdespeck 1argument removes single isolated pixels from the scans, which result in cleaner images, that are also better processed with the optical character recognition software. More about that later. I also tried removing sets of 2 or 3 pixels, but that tended to make the results worse.
The next step in the process of creating a searchable document is to process the images with optical character recognition (OCR) software to extract the text. This is the most tricky part when setting up an open source solution. Not many open source OCR solutions exist and the quality of the results is not on par with commercial products.
I selected three solutions that are still actively developed: tesseract-ocr, gocr and cuneiform. Both
cuneiform proved to give the most accurate results, but, additionally, the PDF option of
tesseract produced the best quality PDF's containing both the original (compressed) image and the resulting text as an overlay, so it's possible to select text when viewing the resulting PDF file in your favourite PDF viewer. With cuneiform, I did not succeed in producing properly aligned text overlays in the PDF's. That's why I settled for tesseract. Since I commonly process documents in Dutch and in English, I use the following command to convert each scanned image into a PDF with text overlay:
tesseract out0001.pnm out0001.pdf -l nld+eng pdf
To concatenate the individual PDF files for each page, I use
pdfunite, one of the tools from the
The image below shows a sample of a scanned document.
Tesseract was able to extract the following text:
TEARS IN HEAVEN Would you know my name if I saw you in heaven W0u|d it be the same if I saw you in heaven I must be strong and carry on ï ’Cause I know I don’t belong here in heaven Would you hold my hand if I saw you in heaven Would you help me stand if I saw you in heaven I’ll find my way Through night and day 'Cause I know I just can't stay here in heaven Time can bring you down, time can bend your knees Time can break your heart, have you beggin’ please, beggin’ please TUSSENSPEL Beyond the door There’s peace, I'm sure And I know there’ll be no more Tears in heaven Would you know my name if l saw you in heme? Would it be the same if I saw you in i must be strong and carry on ë â here in heaven No l know i don’t belong here in heaven
As you can see, the mediocre quality of the print already causes faults in the recognition, although in practical cases a sufficient amount of text is recognized correctly, so that the document shows up in a search.
Now we're almost there. I'd like to put a document in the sheet-feeder and push the scan button on the front of the scanner to automatically start the scanning process. By default, Debian/Ubuntu based distributions have a standard package with a small daemon to do exactly that:
scanbuttond does not support the Fujitsu ScanSnap. Luckily, there's an alternative: scanbd (Ubuntu packages). Installing from source is easy, just follow the instructions in
Finally, I wrapped everything above together in a script
scan.sh that is called by
scanbd when the button is pressed. The OCR analysis of the images is a CPU intensive task and
tesseract is running single threaded, leaving most of your computing power unused. Therefore, I added some logic in the script to launch one tesseract instance for each CPU core found in the system, speeding up the OCR process significantly. The script will send the resulting PDF by email to an email address of your choice (both Evernote and Alfresco support adding documents via email) using the
Summarizing, to get your SnapScan up and running in Debian/Ubuntu Linux, perform the following steps:
# Install all dependencies
sudo apt-get install sane-utils poppler-utils tesseract-ocr tesseract-ocr-nld tesseract-ocr-eng tesseract-ocr-anyOtherLanguageOfYourChoice mime-construct
# Download and install scanbd
tar xf scanbd-1.4.3.tgz
./configure --prefix=/usr --sysconfdir=/etc
sudo make install
#for Ubuntu, download the init script, for other distributions, see the integration folder in the scanbd source
sudo wget 'http://fokke.org/site/sites/default/files/scanbd.conf'
# Download the scan.sh script
sudo wget 'http://fokke.org/site/sites/default/files/scan.sh'
# update configuration file to point the ´action scan' to the scan.sh script