groqcrawl

GroqCrawl is a powerful and user-friendly web crawling and scraping application built with Streamlit and powered by PocketGroq. It provides an intuitive interface for extracting LLM friendly AI consumable content from websites, with support for single-page scraping, multi-page crawling, and site mapping.

jgravelle

Python

GroqCrawl: Advanced Web Crawling and Scraping with Streamlit and PocketGroq

Introduction
Features
Installation
Usage
Advanced Options
Output Formats
Examples
Troubleshooting
Contributing
License

Introduction

Whether you’re a data scientist, researcher, or web developer, GroqCrawl offers a seamless experience for gathering web data in various formats, including Markdown, HTML, and structured data.

Features

Single URL Scraping: Extract content from individual web pages.
Website Crawling: Traverse multiple pages of a website, respecting depth and page limits.
Site Mapping: Generate a list of all accessible URLs within a website.
Multiple Output Formats: Choose from Markdown, HTML, and structured data representations.
Advanced Crawling Options: Customize your crawl with exclude/include paths, depth limits, and more.
Interactive Results Display: View scraped content directly in the Streamlit interface.
Download Options: Save your results as JSON files for further processing.

Installation

Ensure you have Python 3.7 or later installed on your system.

Clone the GroqCrawl repository:

git clone https://github.com/yourusername/groqcrawl.git
cd groqcrawl

Install the required dependencies:
```
pip install -r requirements.txt
```
Set up your PocketGroq API key:
- Create a .env file in the project root directory.
- Add your API key to the file: GROQ_API_KEY=your_api_key_here

Usage

To run GroqCrawl:

Navigate to the project directory:
```
cd path/to/groqcrawl
```
Launch the Streamlit app:
```
streamlit run groqcrawl.py
```
Open your web browser and go to the URL displayed in the terminal (usually http://localhost:8501).
Use the interface to select your scraping type, enter a URL, and configure options.
Click “Run” to start the scraping/crawling process.

Advanced Options

Max Depth: Set the maximum depth for crawling (Crawl mode only).
Max Pages: Limit the total number of pages to crawl (Crawl mode only).
Exclude Paths: Specify URL patterns to exclude from crawling.
Include Only Paths: Limit crawling to specific URL patterns.
Ignore Sitemap: Skip using the sitemap.xml for crawling.
Allow Backwards Links: Enable crawling of links that point to previously visited pages.

Output Formats

Markdown:
- Human-readable text format.
- Ideal for content analysis and easy viewing.
HTML:
- Raw HTML content of the page.
- Useful for detailed structure analysis or further processing.
Structured Data:
- JSON format containing:
  - Full text content
  - Headings (h1 to h6)
  - Links (text and href)
  - Images (src and alt attributes)
  - JSON-LD data (if available)

Examples

Single URL Scraping

Select “Single URL (/scrape)” from the radio buttons.
Enter a URL, e.g., https://example.com.
Choose desired output formats.
Click “Run”.

Website Crawling

Select “Crawl (/crawl)” from the radio buttons.
Enter the starting URL, e.g., https://example.com.
Set Max Depth and Max Pages in the Options section.
Choose desired output formats.
Click “Run”.

Site Mapping

Select “Map (/map)” from the radio buttons.
Enter the website URL, e.g., https://example.com.
Click “Run”.

Troubleshooting

API Key Issues: Ensure your PocketGroq API key is correctly set in the .env file.
Connection Errors: Check your internet connection and verify the URL is accessible.
Slow Performance: For large websites, try reducing Max Depth or Max Pages.
Missing Content: Some websites may block scraping. Check the site’s robots.txt file and consider respecting their scraping policies.

Contributing

We welcome contributions to GroqCrawl! Please follow these steps:

Fork the repository.
Create a new branch for your feature or bug fix.
Make your changes and commit them with clear, descriptive messages.
Push your changes to your fork.
Submit a pull request with a detailed description of your changes.

License

GroqCrawl is released under the MIT License. See the LICENSE file for details.

For more information or support, please open an issue on the GitHub repository.

Happy crawling!