Nextcloud OCR (optical character recoginition) processing for images with tesseract-js

109
17
JavaScript

OCR

Build Status Total alerts Codacy Badge Codacy Badge License: AGPL v3

Nextcloud OCR (optical character recognition) processing for images with tesseract-js brings OCR capability to your Nextcloud.
The app uses tesseract-js by @jeromewu in the browser in order to process images (png, jpeg, tiff, bmp) and saves the output PDF file to the source folder in nextcloud. That for example enables you to search in it.

Prerequisites, Requirements and Dependencies

The OCR app has some prerequisites:

  • Nextcloud 16 and up
  • Only supported on latest modern web browsers (Chrome, Edge, Firefox, Opera, Safari*)
  • Tesseract traineddata needs about 200 MB space on your server (will be installed automatically).

* On Safari there is currently a problem with the Content-Security-Policy, that requires an Administrator to set the ‘script-src’ to ‘unsafe-eval’ such that the app works properly. Because this is quite insecure the app itself does not set it and recommends to decide that on your own risk (please make sure, that you know what CSP is and what e.g. unsafe-eval causes).

Installation

Install the app from the Nextcloud AppStore or download the release package from github (NOT the sources) and place the content in nextcloud/apps/ocr/.

Disclaimer

The software is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied.

Note

The version 3 and earlier versions are not supported/maintained anymore by the author. So for asynchronous background processing please fork the repository and use the “not-maintained” branch to work on improvements. The author wasn’t able to support it because of too much effort.
Moreover this project is based on a webassembly port of tesseract. The maintainer stopped working on PDF processing in this app and will start working on separate app for pdf handling.