File Parser optimised for LLM Ingestion with no loss π§ Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether youβre dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.
https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3
required python version >= 3.11
pip install megaparse
Add your OpenAI or Anthropic API key to the .env file
Install poppler on your computer (images and PDFs)
Install tesseract on your computer (images and PDFs)
If you have a mac, you also need to install libmagic brew install libmagic
from megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.parser.unstructured_parser import UnstructuredParser
parser = UnstructuredParser()
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md")
from megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.parser.megaparse_vision import MegaParseVision
model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY")) # type: ignore
parser = MegaParseVision(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md")
Note: The model supported by MegaParse Vision are the multimodal ones such as claude 3.5, claude 4, gpt-4o and gpt-4.
There is a MakeFile for you, simply use :
make dev
at the root of the project and you are good to go.
See localhost:8000/docs for more info on the different endpoints !
Parser | similarity_ratio |
---|---|
megaparse_vision | 0.87 |
unstructured_with_check_table | 0.77 |
unstructured | 0.59 |
llama_parser | 0.33 |
Higher the better
Note: Want to evaluate and compare your Megaparse module with ours ? Please add your config in evaluations/script.py
and then run python evaluations/script.py
. If it is better, do a PR, I mean, letβs go higher together .