Web Scraper

A powerful, recursive URL-smart web scraping tool designed to efficiently collect and organize content from websites. This tool is perfect for developers, researchers, and data enthusiasts who need to extract large amounts of textual data from web pages.

8
1
Python

πŸ•·οΈ Officely AI Web Scraper

A powerful, recursive, URL-aware web scraping tool designed to efficiently extract structured content from websites. Ideal for developers, researchers, and data teams needing high-volume, high-quality data collection.


πŸš€ Features

  • 🌐 Recursive URL Crawling – Traverse and extract content from linked pages.
  • 🎯 Configurable Depth – Set max depth for recursion to control scope.
  • πŸ” Smart URL Filtering – Include/exclude pages by keyword or prefix.
  • πŸ“ Organized Output – Saves to structured folders by domain.
  • πŸ›‘οΈ Respectful Crawling – Includes retry logic, backoff, and pacing.
  • βš™οΈ Highly Configurable – All logic controlled via config.py.
  • βœ‚οΈ Text Splitting – Splits long texts for better chunking.
  • 🚫 Protocol Exclusion – Skip mailto:, tel:, whatsapp: etc.
  • πŸ” Robust Retry System – With backoff and configurable retries.
  • πŸ”€ Concurrency Control – Define max requests and per-host limits.
  • πŸ•’ Request Pacing – Optional delays to avoid server overload.
  • 🎯 Targeted Extraction – Focus only on specific divs per page.

πŸ§ͺ Example: Targeting Specific Page Sections

Use the target_divs setting to extract only specific HTML components, like a title and article body:

"target_divs": {
    "title": {
        "selector": "#main-content > section > div > div.relative... > header",
        "title": "Article Title"
    },
    "description": {
        "selector": "#main-content > section > div > div.relative... > div",
        "title": "Article Description"
    }
}

Each entry defines:

  • selector: a CSS selector
  • title: the label for output CSV

The scraper will match and extract only those components from the page.


πŸ›  Installation

git clone https://github.com/Royofficely/Web-Scraper.git
cd Web-Scraper
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
python agentim.py install

▢️ Usage

python agentim.py run

βš™οΈ Configuration (config.py)

config = {
    "domain": "https://www.example.com",
    "include_keywords": ["blog"],
    "exclude_keywords": ["signup", "login"],
    "max_depth": 2,
    "target_divs": {...},
    "start_with": ["https://www.example.com/docs"],
    "split_length": 2000,
    "excluded_protocols": ['mailto:', 'tel:', 'whatsapp:'],
    "max_retries": 5,
    "base_delay": 1,
    "concurrent_requests": 10,
    "connections_per_host": 5,
    "delay_between_requests": 0.5
}

πŸ“¦ Output

Scraped results are saved as CSV with columns:

  • URL
  • Chunk
  • Text
  • Tag (if target_divs used)

🧩 Troubleshooting

  • Make sure you’re in the root directory when running.
  • Increase delay_between_requests if rate-limited.
  • Check log output for retries/errors.
  • Use start_with to limit initial crawl scope.

πŸ§‘β€πŸ’» Dev Setup

  1. Install as above
  2. Make edits in officely_web_scraper/
  3. Run agentim.py run to test locally

🧱 Project Structure

.
β”œβ”€β”€ agentim.py
β”œβ”€β”€ officely_web_scraper/
β”‚   β”œβ”€β”€ config.py
β”‚   β”œβ”€β”€ scan.py
β”‚   └── __init__.py
β”œβ”€β”€ install.sh
β”œβ”€β”€ requirements.txt
└── README.md

🀝 Contributing

Pull requests are welcome! Please open an issue for any bugs or suggestions.


πŸ“„ License

MIT License β€’ See LICENSE for details


Made with ❀️ by Roy Nativ @ Officely AI