The Web Metadata Extraction Toolkit is designed to streamline the process of extracting, cleaning, and analyzing metadata from websites. Utilizing advanced AI models and custom extraction strategies, this toolkit helps users efficiently gather data like titles, descriptions, and keywords, which are crucial for SEO and content strategy.
This project provides tools for web scraping, data cleaning, and metadata analysis using Ollama with qwen2:1.5b and Crawl4AI. It enables efficient content extraction techniques.
Clone the repository: https://github.com/varunsaagar/crawlwithagents.git and cd crawlwithagents
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate
On Windows, use venv\Scripts\activate.bat
On Mac, use venv/bin/activate
On Linux, use source venv/bin/activate
pip install -r requirements.txt
python crawler.py
or
The tools can be executed from the command line. Here’s how to run each tool:
This tool scrapes the title, description, and keywords from a specified URL.
bash python -m praisonai_tools.tools.web_scraper_tool.web_scraper_tool --url "https://example.com"
This tool cleans and formats the raw metadata extracted by the Web Scraper Tool.
bash python -m praisonai_tools.tools.data_cleaner_tool --input "path/to/raw_data.json"
This tool analyzes the cleaned metadata to provide insights such as title length, description length, and keyword analysis.
bash python -m praisonai_tools.tools.metadata_analyzer_tool --input "path/to/cleaned_data.json"
The tools are configured to work with the crawl4ai
library and the Ollama model. You can modify the extraction strategy and other parameters in the script.
Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes.
Special thanks to the following resources and individuals:
For any queries or technical support, please contact varunsaagar.