An open-source OCR API that leverages OpenAI's powerful language models with optimized performance techniques like parallel processing and batching to deliver high-quality text extraction from complex PDF documents. Ideal for businesses seeking efficient document digitization and data extraction solutions.
https://github.com/user-attachments/assets/6b39f3ea-248e-4c29-ac2e-b57de64d5d65
Demo video showcasing the conversion of NASA’s Apollo 17 flight documents, which include unorganized, horizontally and vertically oriented pages, into well-structured Markdown format without any issues.
Here’s a single, comprehensive section on cost comparison for your README:
Our solution offers an optimal balance of affordability, accuracy, and advanced features:
For 1000 documents:
This solution is significantly more affordable than alternatives:
While cost-effectiveness is a major advantage, our solution also provides:
This combination of affordability and advanced features makes solution stand out in the document processing market. It’s not just about being cheaper; it’s about providing excellent value through reliability, flexibility, and high-quality output.
Clone the Repository
git clone https://github.com/yigitkonur/llm-openai-ocr.git
cd llm-openai-ocr
Create a Virtual Environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install Dependencies
pip install -r requirements.txt
Configure Environment Variables
Create a .env
file in the root directory and add the following variables:
OPENAI_API_KEY=your_openai_api_key
AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint
OPENAI_DEPLOYMENT_ID=your_openai_deployment_id
OPENAI_API_VERSION=your_openai_api_version # Default is "gpt-4o"
BATCH_SIZE=10 # Optional: Default is 1
MAX_CONCURRENT_OCR_REQUESTS=5 # Optional: Default is 5
MAX_CONCURRENT_PDF_CONVERSION=4 # Optional: Default is 4
Note: Replace
your_openai_api_key
,your_azure_openai_endpoint
, andyour_openai_deployment_id
with your actual OpenAI credentials.
Run the Application
uvicorn main:app --reload
The API will be available at http://127.0.0.1:8000
.
POST /ocr
You must provide either a file or a URL, not both.
curl
Uploading a PDF File:
curl -X POST "http://127.0.0.1:8000/ocr" -F "file=@/path/to/your/document.pdf"
Providing a PDF URL:
curl -X POST "http://127.0.0.1:8000/ocr" -F "ocr_request={\"url\": \"https://example.com/document.pdf\"}" -H "Content-Type: application/json"
200 OK
{
"text": "Extracted and formatted text from the PDF."
}
Error Responses
400 Bad Request
: Invalid input parameters.422 Unprocessable Entity
: Validation errors.500 Internal Server Error
: Processing errors.All configurations are managed via environment variables. Ensure you have a .env
file set up with the necessary variables as described in the Installation section.
Please note that PyMuPDF requires changing the license to GNU AGPL v3.0. You can fork this project, implement pdf2image, and use it freely. While I don’t have any particular interest in licensing, I am legally obligated to add this information.
GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007
Copyright © 2024 Yiğit Konur
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see https://www.gnu.org/licenses/.