A powerful and flexible web crawler that can extract structured data from any website. Whether you're scraping products, articles, business listings, or any other content, this crawler can be configured to handle it.
A powerful and flexible web crawler that uses Groqβs LLM API to intelligently extract structured data from any website. Perfect for data scientists, researchers, and developers who need to gather and analyze web data.
π§ Intelligent Extraction
π― Flexible Targeting
π οΈ Easy Configuration
π Data Management
π Safe & Respectful
To create a new crawler configuration:
Use the generator script:
python create_config.py
This will guide you through creating a new configuration by asking:
Or manually create one:
First, install the required dependencies:
pip install -r requirements.txt
Create a .env
file with your Groq API key:
GROQ_API_KEY=your_api_key_here
Get your API key from Groq Console
config.py
: Contains crawler configurations for different use cases
main.py
: The entry point of the crawler
First, try the test configuration to verify your setup:
π§ͺ test
: Local testing configuration
python main.py --config test
Then use these templates for your scraping needs:
Here are templates you can use for different scraping scenarios:
"ecommerce": {
**DEFAULT_CONFIG,
"BASE_URL": "https://example-store.com/products",
"CSS_SELECTOR": "div.product-card",
"REQUIRED_KEYS": [
"name",
"price",
"description"
],
"OPTIONAL_KEYS": [
"sku",
"category",
"stock_status",
"rating",
"reviews_count",
"image_url"
],
"LLM_CONFIG": {
**DEFAULT_CONFIG["LLM_CONFIG"],
"INSTRUCTION": """
Extract product information from each product card:
- Name: Product title
- Price: Current price (remove currency symbol)
- Description: Product description
- SKU: Product identifier
- Category: Product category
- Stock Status: In stock/Out of stock
- Rating: Numerical rating
- Reviews Count: Number of reviews
- Image URL: Product image URL
"""
}
}
"news": {
**DEFAULT_CONFIG,
"BASE_URL": "https://example-news.com",
"CSS_SELECTOR": "article.news-item",
"REQUIRED_KEYS": [
"title",
"content",
"date_published"
],
"OPTIONAL_KEYS": [
"author",
"category",
"tags",
"image_url",
"comments_count"
],
"CRAWLER_CONFIG": {
**DEFAULT_CONFIG["CRAWLER_CONFIG"],
"MULTI_PAGE": True,
"MAX_PAGES": 3,
"DELAY_BETWEEN_PAGES": 5
}
}
"jobs": {
**DEFAULT_CONFIG,
"BASE_URL": "https://example-jobs.com/listings",
"CSS_SELECTOR": "div.job-posting",
"REQUIRED_KEYS": [
"title",
"company",
"location",
"description"
],
"OPTIONAL_KEYS": [
"salary",
"job_type",
"experience_level",
"posted_date",
"benefits",
"skills_required"
],
"CRAWLER_CONFIG": {
**DEFAULT_CONFIG["CRAWLER_CONFIG"],
"MULTI_PAGE": True,
"MAX_PAGES": 10
}
}
"real_estate": {
**DEFAULT_CONFIG,
"BASE_URL": "https://example-realty.com/listings",
"CSS_SELECTOR": "div.property-listing",
"REQUIRED_KEYS": [
"address",
"price",
"bedrooms",
"bathrooms",
"square_feet"
],
"OPTIONAL_KEYS": [
"property_type",
"year_built",
"lot_size",
"amenities",
"agent_info",
"images",
"description"
],
"LLM_CONFIG": {
**DEFAULT_CONFIG["LLM_CONFIG"],
"INSTRUCTION": """
Extract property information from each listing:
- Address: Full property address
- Price: Listing price (numbers only)
- Bedrooms: Number of bedrooms
- Bathrooms: Number of bathrooms
- Square Feet: Property size
- Property Type: House/Condo/etc.
- Year Built: Construction year
- Lot Size: Land area
- Amenities: List of features
- Agent Info: Contact information
- Images: Property image URLs
- Description: Property description
"""
}
}
List available configurations:
python main.py --list
Run with your configuration:
python main.py --config test # Start with test configuration
python main.py --config ecommerce # Run e-commerce scraping
python main.py --config news # Run news article scraping
python main.py --config jobs # Run job listing scraping
python main.py --config real_estate # Run real estate scraping
Note: Make sure to add the configuration to config.py first using the templates above.
The crawler generates two CSV files:
items.csv
: Contains all scraped itemscomplete_items.csv
: Contains only items with all required fieldsTo create a new configuration:
config.py
CONFIGS
dictionaryExample:
"my_config": {
**DEFAULT_CONFIG, # Inherit from default
"BASE_URL": "https://example.com",
"CSS_SELECTOR": "div.item",
"REQUIRED_KEYS": ["name", "price"],
"CRAWLER_CONFIG": {
**DEFAULT_CONFIG["CRAWLER_CONFIG"],
"MULTI_PAGE": False
}
}
If you encounter issues:
See LICENSE file for details.