Top Python Frameworks & Libraries for web crawling 59

scrapy/scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

57414

10944

Python

gaojiuli/gain

Web crawling framework based on asyncio.

2007

213

Python

ArchiveTeam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

1495

146

Python

fcavallarin/htcap

htcap is a web application scanner able to crawl single page application (SPA) recursively by intercepting ajax calls and DOM changes....

618

112

Python

essandess/isp-data-pollution

ISP Data Pollution to Protect Private Browsing History with Obfuscation

606

Python

kootenpv/sky

:sunrise: next generation web crawling using machine intelligence

331

Python

rivermont/spidy

The simple, easy to use command line web crawler.

346

Python

alephdata/memorious

Lightweight web scraping toolkit for documents and structured data.

311

Python

alex-miller-0/Tor_Crawler

Web crawling with IP rotation via Tor

194

Python

TeamHG-Memex/arachnado

Web Crawling UI and HTTP API, based on Scrapy and Tornado

148

Python

voliveirajr/seleniumcrawler

An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site...

123

Python

my8100/scrapyd-cluster-on-heroku

Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO :point_right:...

115

Python

cloudviz/agentless-system-crawler

A tool to crawl systems like crawlers for the web

106

Python

cpatrickalves/scraping-ebay

Scraping Ebay's products using Scrapy Web Crawling Framework

Python

amitupreti/Hands-on-WebScraping

This repo is a part of blog series on several web scraping projects where we will explore scraping techniques to crawl data from simple websites to websites using...

Python

mentatpsi/OSGenome

An Open Source Web Application for Genetic Data (SNPs) using 23AndMe and Data Crawling Technologies

Python

datawizard1337/ARGUS

ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the web...

Python

amineHorseman/images-web-crawler

This package is a complete tool for creating a large dataset of images (specially designed -but not only- for machine learning enthusiasts). It can crawl the web,...

Python

dongweiming/daenerys

Scraping and Web Crawling Framework For Zhihu Live

Python

estin/pomp

Screen scraping and web crawling framework

Python

intohole/xspider

easy crawl web resource , extract web infomation/简单的爬虫框架

Python

LeiShi1313/serverless-web-differ

A serverless web browser which crawls websites and compares pages by schedule.

Python

internetarchive/umbra

A queue-controlled browser automation tool for improving web crawl quality

Python

CrawlScript/WebCollector-Python

WebCollector-Python is an open source web crawler framework based on Python.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded...

Python

grantwilliams/wg-gesucht-crawler-cli

Python web crawler / scraper for WG-Gesucht. Crawls the WG-Gesucht site for new apartment listings and send a message to the poster, based off your saved filters a...

Python

guptachetan1997/crawling-projects

Web scraping and automation using python

Python

fugary/calibre-douban

Calibre new douban metadata source plugin. Douban no longer provides book APIs to the public, so it can only use web crawling to obtain data. This is a calibre Dou...

1155

Python

seleniumbase/SeleniumBase

Python APIs for web automation, testing, and bypassing bot-detection.

11299

1380

Python

apify/crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG,...

5774

394

Python

adbar/trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML...

4436

298

Python

lorien/grab

Web Scraping Framework

2404

275

Python

elliotgao2/gain

Web crawling framework based on asyncio.

2033

206

Python

SkywalkerDarren/chatWeb

ChatWeb can crawl web pages, read PDF, DOCX, TXT, and extract the main content, then answer your questions based on the content, or summarize the key points....

905

135

Python

Florents-Tselai/WarcDB

WarcDB: Web crawl data as SQLite databases.

398

Python

paulpierre/markdown-crawler

A multithreaded 🕸️ web crawler that recursively crawls a website and creates a 🔽 markdown file for each page, designed for LLM RAG...

374

Python

scrapfly/scrapfly-scrapers

Scalable Python web scraping scripts for +40 popular domains

471

119

Python

D4Vinci/Scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!...

5471

302

Python

cxcscmu/Crawl4LLM

Official repository for "Crawl4LLM: Efficient Web Crawling for LLM Pretraining"

460

Python

cxcscmu/Craw4LLM

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

612

Python

langmanus/langmanus

A community-driven AI automation framework that builds upon the incredible work of the open source community. Our goal is to combine language models with specializ...

5365

569

Python

coleam00/mcp-crawl4ai-rag

Web Crawling and RAG Capabilities for AI Agents and AI Coding Assistants

1167

389

Python