A powerful, recursive URL-smart web scraping tool designed to efficiently collect and organize content from websites. This tool is perfect for developers, researchers, and data enthusiasts who need to extract large amounts of textual data from web pages.
A powerful, recursive, URL-aware web scraping tool designed to efficiently extract structured content from websites. Ideal for developers, researchers, and data teams needing high-volume, high-quality data collection.
config.py
.mailto:
, tel:
, whatsapp:
etc.Use the target_divs
setting to extract only specific HTML components, like a title and article body:
"target_divs": {
"title": {
"selector": "#main-content > section > div > div.relative... > header",
"title": "Article Title"
},
"description": {
"selector": "#main-content > section > div > div.relative... > div",
"title": "Article Description"
}
}
Each entry defines:
selector
: a CSS selectortitle
: the label for output CSVThe scraper will match and extract only those components from the page.
git clone https://github.com/Royofficely/Web-Scraper.git
cd Web-Scraper
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
python agentim.py install
python agentim.py run
config = {
"domain": "https://www.example.com",
"include_keywords": ["blog"],
"exclude_keywords": ["signup", "login"],
"max_depth": 2,
"target_divs": {...},
"start_with": ["https://www.example.com/docs"],
"split_length": 2000,
"excluded_protocols": ['mailto:', 'tel:', 'whatsapp:'],
"max_retries": 5,
"base_delay": 1,
"concurrent_requests": 10,
"connections_per_host": 5,
"delay_between_requests": 0.5
}
Scraped results are saved as CSV with columns:
URL
Chunk
Text
Tag
(if target_divs
used)delay_between_requests
if rate-limited.start_with
to limit initial crawl scope.officely_web_scraper/
agentim.py run
to test locally.
βββ agentim.py
βββ officely_web_scraper/
β βββ config.py
β βββ scan.py
β βββ __init__.py
βββ install.sh
βββ requirements.txt
βββ README.md
Pull requests are welcome! Please open an issue for any bugs or suggestions.
MIT License β’ See LICENSE
for details
Made with β€οΈ by Roy Nativ @ Officely AI