Data Processor for AI Agents. Search your documents or the web for specific data and get it back in JSON or Markdown in a single tool call.
Features β’ What is a Schema? β’ Use Cases β’ Getting Started β’ Documentation
open-extract simplifies the ingestion and processing of unstructured data for those building AI Agents/Agentic Workflows using frameworks such as LangGraph, AG2, and CrewAI.
π Extract Relevant Information Seamlessly: Give your applications the ability to identify and extract relevant data from one or many large documents and websites with just a single API call. Get the content back in JSON or Markdown formats, making it easy to integrate into your workflows.
π Multi-Schema/Multi-Document Support: Extract data based one or many predefined schemas from a variety of document types, without needing a vector database or specifying page numbers.
π Built-in Caching: With built-in caching, previously extracted schemas can be instantly retrieved, enabling rapid repeat extractions without having to reprocess the original documents.
π« No Vendor Lock-In: Enjoy complete flexibility with your choice of model provider. Whether using open-source or closed-source models, youβre never tied to a specific vendor, ensuring full control.
A schema is a set of key-value pairs describing what needs to be extracted from a particular document.
{
"Firm": "The name of the firm",
"Number of Funds": "The number of funds managed by the firm",
"Commitment": "The commitment amount in millions of dollars",
"% of Total Comm": "The percentage of total commitment",
"Exposure (FMV + Unfunded)": "The exposure including fair market value and unfunded commitments in millions of dollars",
"% of Total Exposure": "The percentage of total exposure",
"TVPI": "Total Value to Paid-In multiple",
"Net IRR": "Net Internal Rate of Return as a percentage"
}
πΌ Financial Report Analysis | π Customer Feedback Processing | π¬ Research Assistant | π§ Legal Contract Parsing |
Extract key financial metrics from quarterly PDF reports | Categorize feedback from various document types | Process research papers, extracting methodologies and findings | Extract key legal terms and conditions from contracts |
To build the platform from source, run the following command:
./start-oe.sh
Once the platform is running you can test it out by trying one of our examples
Navigate to the examples folder:
cd examples
Navigate to the scripts or notebooks folder:
cd scripts
or
cd notebooks/autogen_example
Run one of our example scripts:
python azure_example.py