A powerful, recursive URL-smart web scraping tool designed to efficiently collect and organize content from websites. This tool is perfect for developers, researchers, and data enthusiasts who need to extract large amounts of textual data from web pages.
A powerful, recursive URL-smart web scraping tool designed to efficiently collect and organize content from websites. This tool is perfect for developers, researchers, and data enthusiasts who need to extract large amounts of textual data from web pages.
git clone https://github.com/Royofficely/Web-Scraper.git
cd Web-Scraper
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
python agentim.py install
This command will install the package, its dependencies, and create the initial configuration.After installation, you can run the scraper from the project directory:
python agentim.py run
The scraperβs behavior can be customized by editing the config.py
file in the officely_web_scraper
directory:
config = {
"domain": "https://www.example.com", # The main domain URL for scraping
"include_keywords": None, # List of keywords to include in URLs
"exclude_keywords": None, # List of keywords to exclude from URLs
"max_depth": 1, # Maximum recursion depth (None for unlimited)
"target_div": None, # Specific div class to target (None for whole page)
"start_with": None, # Filter by "start with" the url. For example: ["https://example.com/blog"]
"split_length": 2000, # Maximum length of text chunks for CSV rows
"excluded_protocols": ['whatsapp:', 'tel:', 'mailto:'], # Protocols to exclude from scraping
"max_retries": 5, # Maximum number of retry attempts for failed requests
"base_delay": 1, # Base delay (in seconds) for exponential backoff
"concurrent_requests": 10, # Maximum number of concurrent requests
"connections_per_host": 5, # Maximum number of connections per host
"delay_between_requests": 0.5, # Delay (in seconds) between individual requests
}
Adjust these settings according to your scraping needs.
The scraped content will be saved in a CSV file within a directory named after the domain youβre scraping. The CSV file will contain columns for the URL, scraped text, and chunk number (for split texts).
If you encounter any issues:
concurrent_requests
, delay_between_requests
) if youβre experiencing rate limiting or other access issues.To set up the project for development:
python agentim.py install
for installation..
βββ LICENSE
βββ README.md
βββ agentim.py
βββ install.sh
βββ officely-scraper
βββ officely_web_scraper
β βββ __init__.py
β βββ config.py
β βββ scan.py
βββ requirements.txt
βββ setup.py
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Created with β€οΈ by Roy Nativ/Officely AI
For any questions or support, please open an issue on the GitHub repository.