A powerful command-line OCR tool built with Apple's Vision framework, supporting single image and batch processing with detailed positional information output.
A powerful command-line OCR tool built with Apple’s Vision framework, supporting single image and batch processing with detailed positional information output.
It is recommended that macOS 13 or later be used in preference to macOS 13 or later for the best OCR recognition.
Ensure Xcode and Command Line Tools are installed
Clone the repository:
git clone https://github.com/your-username/macos-vision-ocr.git
cd macos-vision-ocr
For Apple Silicon (arm64):
swift build -c release --arch arm64
For Intel (x86_64):
swift build -c release --arch x86_64
Process a single image and output to console:
./macos-vision-ocr --img ./images/handwriting.webp
Process with custom output directory:
./macos-vision-ocr --img ./images/handwriting.webp --output ./images
Recognition languages can be specified using the --rec-langs
option. For example:
./macos-vision-ocr --img ./images/handwriting.webp --rec-langs "zh-Hans, zh-Hant, en-US"
Process multiple images in a directory:
./macos-vision-ocr --img-dir ./images --output-dir ./output
Merge all results into a single file:
./macos-vision-ocr --img-dir ./images --output-dir ./output --merge
Enable debug mode to visualize text detection:
./macos-vision-ocr --img ./images/handwriting.webp --debug
Options:
--img <path> Path to a single image file
--output <path> Output directory for single image mode
--img-dir <path> Directory containing images for batch mode
--output-dir <path> Output directory for batch mode
--merge Merge all text outputs into a single file in batch mode
--debug Debug mode: Draw bounding boxes on the image
--lang Show supported recognition languages
--help Show help information
The tool outputs JSON with the following structure:
{
"texts": "The Llama 3.2-Vision Collection of multimodal large langyage model5 (LLMS) is a\ncollection of instruction-tuned image reasoning generative models in l1B and 90B\nsizes (text + images in / text ovt). The Llama 3.2-Vision instruction-tuned models\nare optimized for visval recognittion, iage reasoning, captioning, and answering\ngeneral qvestions about an iage. The models outperform many of the available\nopen Source and Closed multimodal models on common industry benchmarKs.",
"info": {
"filepath": "./images/handwriting.webp",
"width": 1600,
"filename": "handwriting.webp",
"height": 720
},
"observations": [
{
"text": "The Llama 3.2-Vision Collection of multimodal large langyage model5 (LLMS) is a",
"confidence": 0.5,
"quad": {
"topLeft": {
"y": 0.28333333395755611,
"x": 0.09011629800287288
},
"topRight": {
"x": 0.87936045388666206,
"y": 0.28333333395755611
},
"bottomLeft": {
"x": 0.09011629800287288,
"y": 0.35483871098527953
},
"bottomRight": {
"x": 0.87936045388666206,
"y": 0.35483871098527953
}
}
}
]
}
When using --debug
, the tool will:
Here’s an example of how to use macos-vision-ocr
in a Node.js application:
const { exec } = require("child_process");
const util = require("util");
const execPromise = util.promisify(exec);
async function performOCR(imagePath, outputDir = null) {
try {
// Construct the command
let command = `./macos-vision-ocr --img "${imagePath}"`;
if (outputDir) {
command += ` --output "${outputDir}"`;
}
// Execute the OCR command
const { stdout, stderr } = await execPromise(command);
if (stderr) {
console.error("Error:", stderr);
return null;
}
// Parse the JSON output
console.log("stdout:", stdout);
const result = JSON.parse(stdout);
return result;
} catch (error) {
console.error("OCR processing failed:", error);
return null;
}
}
// Example usage
async function example() {
const result = await performOCR("./images/handwriting.webp");
if (result) {
console.log("Extracted text:", result.texts);
console.log("Text positions:", result.observations);
}
}
example();
Image Loading Fails
No Text Detected
This project is licensed under the MIT License - see the LICENSE file for details.
Built with: