flink crawler

Continuous scalable web crawler built on top of Flink and crawler-commons

Java

flink-crawler

A continuous scalable web crawler built on top of Flink and crawler-commons, with bits of code borrowed from bixo.

The primary goals of flink-crawler are:

Continuous, meaning pages are always being fetched. This avoids the inefficiencies of a batch-oriented crawler such as Bixo or Nutch, where the time spent processing the “crawl frontier” (aka CrawlDB) in each loop grows to where it winds up dominating the total time.
Scalable, meaning the crawler should work for small crawls of a 100K pages up to big crawls which fetch billions of pages and track 100B+ links.
Focused, meaning the crawler can be tuned to focus on pages and domains with the highest value, thus improving the efficiency of the crawl.
Simple, meaning operationally it should be easy to set up and run a crawl, without requiring additional infrastructure beyond what’s needed for Flink.

See the Key Design Decisions page for more details.