Web scraping framework built for AI applications. Extract clean, structured content from any website with dynamic content handling, markdown conversion, and intelligent crawling capabilities. Perfect for RAG applications and AI training data pipelines. Features async processing, browser management, and Prometheus monitoring.
eGet is a high-performance, production-grade web scraping framework built with Python.
It provides a robust API for extracting content from web pages with features like dynamic content handling, structured data extraction, and extensive customization options.
eGet transforms complex websites into AI-ready content with a single API call, handling everything from JavaScript-rendered pages to dynamic content while delivering clean, structured markdown thatβs perfect for RAG applications.
With its powerful crawling capabilities and intelligent content extraction, developers can effortlessly build comprehensive knowledge bases by turning any website into high-quality training data, making it an essential tool for teams building modern AI applications that need to understand and process web content at scale.
Dynamic Content Handling:
Content Extraction:
Performance & Reliability:
Additional Features:
eGet/
βββ api/
β βββ __init__.py
β βββ v1/
β βββ endpoints/
β βββ crawler.py # Crawler endpoint
β βββ scraper.py # Scraper endpoint
β βββ chunker.py # Semantic chunking endpoint
β βββ converter.py # File conversion endpoint
βββ core/
β βββ __init__.py
β βββ config.py # Enhanced settings and configuration
β βββ exceptions.py # Extended custom exception classes
β βββ logging.py # Logging configuration
βββ models/
β βββ __init__.py
β βββ crawler_request.py # Crawler request models
β βββ crawler_response.py # Crawler response models
β βββ request.py # Scraper request models
β βββ response.py # Scraper response models
β βββ chunk_request.py # Chunk request models
β βββ chunk_response.py # Chunk response models
β βββ file_conversion_models.py # File conversion models
βββ services/
β βββ cache/
β β βββ __init__.py
β β βββ cache_service.py # Enhanced cache implementation
β βββ crawler/
β β βββ __init__.py
β β βββ crawler_service.py # Main crawler implementation
β β βββ link_extractor.py # Enhanced URL extraction
β β βββ queue_manager.py # Advanced queue management
β βββ chunker/
β β βββ __init__.py
β β βββ chunk_service.py # Chunk service implementation
β β βββ semantic_chunker.py # Enhanced chunking implementation
β β βββ markdown_parser.py # Advanced markdown parsing
β βββ converters/ # Document conversion services
| | βββ __init__.py
| | βββ base_converter.py # Base converter abstract class
| | βββ document_structure.py # Document structure management
| | βββ file_utils.py # File handling utilities
| | βββ converter_factory.py # Converter instantiation factory
| | βββ conversion_service.py # Main conversion orchestrator
| | βββ converters/ # Individual converter implementations
| | βββ __init__.py
| | βββ pdf_converter.py # PDF conversion implementation
| | βββ docx_converter.py # DOCX conversion implementation
| | βββ xlsx_converter.py # XLSX conversion implementation
β βββ extractors/
β β βββ structured_data.py # Enhanced structured data extraction
β β βββ validators.py # Extended data validation
β βββ scraper/
β βββ __init__.py
β βββ scraper.py # Enhanced scraper implementation
βββ .env.template # Extended environment template
βββ docker-compose.yml # Base Docker composition
βββ docker-compose.dev.yml # Development Docker composition
βββ docker-compose.prod.yml # Production Docker composition
βββ Dockerfile # Enhanced Docker build
βββ main.py # Enhanced application entry
βββ prometheus.yml # Prometheus monitoring config
βββ requirements.txt # Updated Python dependencies
git clone https://github.com/yourusername/eget.git
cd eget
# Create virtual environment
python -m venv venv
# Activate on Windows
.\venv\Scripts\activate
# Activate on Unix or MacOS
source venv/bin/activate
pip install -r requirements.txt
playwright install chromium
.env
file:DEBUG=True
LOG_LEVEL=INFO
PORT=8000
WORKERS=1
We provide two environments for running eGet:
docker-compose -f docker-compose.yml -f docker-compose.dev.yml up --build -d
This will start:
eGet API service on port 8000 (with hot-reload)
Redis cache on port 6379
Prometheus monitoring on port 9090
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d
This starts production services with:
- Optimized resource limits
- Proper restart policies
- Security configurations
- Redis cache
- Prometheus monitoring
Configure the service through environment variables:
environment:
# API Settings
- DEBUG=false
- LOG_LEVEL=INFO
- WORKERS=4
- MAX_CONCURRENT_SCRAPES=5
- TIMEOUT=30
# Cache Settings
- CACHE_ENABLED=true
- CACHE_TTL=86400 # Cache duration in seconds (24 hours)
- REDIS_URL=redis://redis:6379
# Chrome Settings
- PYTHONUNBUFFERED=1
- CHROME_BIN=/usr/bin/google-chrome
import requests
def scrape_page():
url = "http://localhost:8000/api/v1/scrape"
# Configure scraping options
payload = {
"url": "https://example.com",
"formats": ["markdown", "html"],
"onlyMainContent": True,
"includeScreenshot": False,
"includeRawHtml": False,
"waitFor": 2000, # Wait for dynamic content
"extract": {
"custom_config": {
"remove_ads": True,
"extract_tables": True
}
}
}
response = requests.post(url, json=payload)
result = response.json()
if result["success"]:
# Access extracted content
markdown_content = result["data"]["markdown"]
html_content = result["data"]["html"]
metadata = result["data"]["metadata"]
structured_data = result["data"]["structured_data"]
print(f"Title: {metadata.get('title')}")
print(f"Language: {metadata.get('language')}")
print("\nContent Preview:")
print(markdown_content[:500])
# The extracted content is clean and ready for:
# 1. Creating embeddings for vector search
# 2. Feeding into LLMs as context
# 3. Knowledge graph construction
# 4. Document indexing
return result
import requests
from typing import List, Dict
def crawl_site_for_rag() -> List[Dict]:
url = "http://localhost:8000/api/v1/crawl"
# Configure crawling parameters
payload = {
"url": "https://example.com",
"max_depth": 2, # How deep to crawl
"max_pages": 50, # Maximum pages to process
"exclude_patterns": [
r"\/api\/.*", # Skip API endpoints
r".*\.(jpg|jpeg|png|gif)$", # Skip image files
r"\/tag\/.*", # Skip tag pages
r"\/author\/.*" # Skip author pages
],
"include_patterns": [
r"\/blog\/.*", # Focus on blog content
r"\/docs\/.*" # And documentation
],
"respect_robots_txt": True
}
response = requests.post(url, json=payload)
pages = response.json()
# Process crawled pages for RAG
processed_documents = []
for page in pages:
doc = {
"url": page["url"],
"content": page["markdown"],
"metadata": {
"title": page.get("structured_data", {}).get("metaData", {}).get("title"),
"description": page.get("structured_data", {}).get("metaData", {}).get("description"),
"language": page.get("structured_data", {}).get("metaData", {}).get("language")
}
}
processed_documents.append(doc)
return processed_documents
# Usage in RAG pipeline
documents = crawl_site_for_rag()
# Now you can:
# 1. Create embeddings for each document
# 2. Store in vector database
# 3. Use for retrieval in RAG applications
The scraper returns clean, structured data ready for AI processing:
{
"success": True,
"data": {
"markdown": "# Main Title\n\nClean, processed content...",
"html": "<div>Clean HTML content...</div>",
"metadata": {
"title": "Page Title",
"description": "Page description",
"language": "en",
"sourceURL": "https://example.com",
"statusCode": 200
},
"structured_data": {
"jsonLd": [...], # JSON-LD data
"openGraph": {...}, # OpenGraph metadata
"twitterCard": {...}, # Twitter Card data
"metaData": {...} # Additional metadata
}
}
}
The API exposes Prometheus metrics at /metrics
:
scraper_requests_total
: Total scrape requestsscraper_errors_total
: Error countscraper_duration_seconds
: Scraping durationAccess Prometheus dashboard at http://localhost:9090
We welcome contributions! Hereβs how you can help:
git checkout -b feature/AmazingFeature
python -m venv venv
source venv/bin/activate # or .\venv\Scripts\activate on Windows
pip install -r requirements.txt
pre-commit install
Code Style:
Testing:
Error Handling:
Documentation:
feat: add new feature
fix: resolve bug
docs: update documentation
test: add tests
refactor: code improvements
This project is licensed under the Apache License - see the LICENSE file for details.