Reply to comment

The right document scanner

This post is the second in a series about going paperless.

Going paperless would of course be rather straight forward when all communication is already electronic and all you have to do is storing it somewhere where it can be indexed and organized in your preferred way.

Unfortunately, quite some important artefacts are still provided only on paper. Therefore, the first step to set towards going paperless is to provide a way to convert the paper artefacts as smoothly as possible into electronic searchable documents.

This conversion starts with selecting a document scanner fit for the job. In my case fit for the job required a device that is capable of scanning two-sided multi-page documents with sufficient speed (>10 pages per minute). Colour scanning would be a nice plus. Since I am aiming to implement a fully open source solution, the availability of Linux drivers and ideally SANE support is required as well.

These requirements put you mostly in the range of small business devices, which are offered by several manufacturers. Epson for instance offers with the DS 510 a device that from the specifications should be a good fit.

Epson Workforce DS-510

I could continue listing some additional options, before concluding with what almost any source on the internet concludes as well: the Fujitsu ScanSnap iX500 scanner is probably the best suitable scanner in the market for medium workloads. It has support for Windows and Mac, it scans directly into several cloud storage solutions, it has built-in WiFi to integrate into your home network, but it also has SANE support, so it's well usable in a Linux environment. With a price around €450,=, it doesn't come cheap, but for having a reliable scanner it is definitively worth it.

Fujitsu ScanSnap iX500

Since the Epson was much less available in the Netherlands, I settled for the SnapScan.

Getting the ScanSnap to work in SANE for Debian/Ubuntu based Linux distributions is really simple. Just install the command line utilities:

apt-get install sane-utils

and scan your first document with the scanimage command line tool:
scanimage --batch='out%04d.pnm' -y 298 --source 'ADF Duplex'

The arguments I used are --batch, -y and --source. The first tells scanimage to scan all pages in the feeder and the second indicates the height of the document to be scanned in millimetres. The amount is slightly larger than the height of an A4 page, which allows the additional carrier sheet (used to scan several small items like receipts or photos at once) to be fully processed. The third option (--source ADF Duplex) tells scanimage to scan both the front and the back of each page. By default, scanimage only scans the front. If you have multiple scanners connected, also add the device with the -d option. I my case: -d fujitsu:ScanSnap iX500:85330.

After some trial and error, I added two additional arguments to the command:

scanimage --batch='out%04d.pnm' -y 298 --source 'ADF Duplex' --swskip 15 --swdespeck 1

The first argument (--swskip) ensures that empty pages are skipped and the value of 15 is an empirically derived threshold that work best for a variety of documents. The --swdespeck 1 argument removes single isolated pixels from the scans, which result in cleaner images, that are also better processed with the optical character recognition software. More about that later. I also tried removing sets of 2 or 3 pixels, but that tended to make the results worse.

The next step in the process of creating a searchable document is to process the images with optical character recognition (OCR) software to extract the text. This is the most tricky part when setting up an open source solution. Not many open source OCR solutions exist and the quality of the results is not on par with commercial products.

I selected three solutions that are still actively developed: tesseract-ocr, gocr and cuneiform. Both tesseract-ocr and cuneiform proved to give the most accurate results, but, additionally, the PDF option of tesseract produced the best quality PDF's containing both the original (compressed) image and the resulting text as an overlay, so it's possible to select text when viewing the resulting PDF file in your favourite PDF viewer. With cuneiform, I did not succeed in producing properly aligned text overlays in the PDF's. That's why I settled for tesseract. Since I commonly process documents in Dutch and in English, I use the following command to convert each scanned image into a PDF with text overlay:
tesseract out0001.pnm out0001.pdf -l nld+eng pdf
To concatenate the individual PDF files for each page, I use pdfunite, one of the tools from the poppler-utils package.

The image below shows a sample of a scanned document.

Sample scan of a document

Tesseract was able to extract the following text:

TEARS IN HEAVEN

Would you know my name if I saw you in heaven
W0u|d it be the same if I saw you in heaven
I must be strong and carry on ï
’Cause I know I don’t belong here in heaven

Would you hold my hand if I saw you in heaven
Would you help me stand if I saw you in heaven
I’ll find my way
Through night and day
'Cause I know I just can't stay here in heaven

Time can bring you down, time can bend your knees
Time can break your heart, have you beggin’ please, beggin’
please

TUSSENSPEL

Beyond the door
There’s peace, I'm sure
And I know there’ll be no more
Tears in heaven

Would you know my name if l saw you in heme?
Would it be the same if I saw you in
i must be strong and carry on
ë
â
here in heaven
No l know i don’t belong here in heaven

As you can see, the mediocre quality of the print already causes faults in the recognition, although in practical cases a sufficient amount of text is recognized correctly, so that the document shows up in a search.

Now we're almost there. I'd like to put a document in the sheet-feeder and push the scan button on the front of the scanner to automatically start the scanning process. By default, Debian/Ubuntu based distributions have a standard package with a small daemon to do exactly that: scanbuttond. Unfortunately scanbuttond does not support the Fujitsu ScanSnap. Luckily, there's an alternative: scanbd (Ubuntu packages). Installing from source is easy, just follow the instructions in doc/README.txt.

Finally, I wrapped everything above together in a script scan.sh that is called by scanbd when the button is pressed. The OCR analysis of the images is a CPU intensive task and tesseract is running single threaded, leaving most of your computing power unused. Therefore, I added some logic in the script to launch one tesseract instance for each CPU core found in the system, speeding up the OCR process significantly. The script will send the resulting PDF by email to an email address of your choice (both Evernote and Alfresco support adding documents via email) using the mime-construct tool.

Summarizing, to get your SnapScan up and running in Debian/Ubuntu Linux, perform the following steps:

# Install all dependencies
sudo apt-get install sane-utils poppler-utils tesseract-ocr tesseract-ocr-nld tesseract-ocr-eng tesseract-ocr-anyOtherLanguageOfYourChoice mime-construct

# Download and install scanbd
mkdir ~/scanbd
cd ~/scanbd
wget 'http://downloads.sourceforge.net/project/scanbd/releases/scanbd-1.4.3.tgz'
tar xf scanbd-1.4.3.tgz
cd 1.4.3
./configure --prefix=/usr --sysconfdir=/etc
make
sudo make install

#for Ubuntu, download the init script, for other distributions, see the integration folder in the scanbd source
cd /etc/init/
sudo wget 'http://fokke.org/site/sites/default/files/scanbd.conf'

# Download the scan.sh script
cd /etc/scanbd/scripts
sudo wget 'http://fokke.org/site/sites/default/files/scan.sh'

# update configuration file to point the ´action scan' to the scan.sh script
sudoedit /etc/scanbd/scanbd.conf

AttachmentSize
scan.sh3.78 KB
scan-2015-08-16t21.46.28.pdf67.02 KB
scanbd.conf1.26 KB

Reply

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options

Refresh Type the characters you see in this picture. Type the characters you see in the picture; if you can't read them, submit the form and a new image will be generated. Not case sensitive.