flink crawler

Continuous scalable web crawler built on top of Flink and crawler-commons

52
18
Java

flink-crawler

A continuous scalable web crawler built on top of Flink and crawler-commons, with bits of code borrowed from bixo.

The primary goals of flink-crawler are:

  • Continuous, meaning pages are always being fetched. This avoids the inefficiencies of a batch-oriented crawler such as Bixo or Nutch, where the time spent processing the “crawl frontier” (aka CrawlDB) in each loop grows to where it winds up dominating the total time.
  • Scalable, meaning the crawler should work for small crawls of a 100K pages up to big crawls which fetch billions of pages and track 100B+ links.
  • Focused, meaning the crawler can be tuned to focus on pages and domains with the highest value, thus improving the efficiency of the crawl.
  • Simple, meaning operationally it should be easy to set up and run a crawl, without requiring additional infrastructure beyond what’s needed for Flink.

See the Key Design Decisions page for more details.