anemone

Anemone web-spider framework

1612

323

Ruby

= Anemone

Anemone is a web spider framework that can spider a domain and collect useful
information about the pages it visits. It is versatile, allowing you to
write your own specialized spider tasks quickly and easily.

See http://anemone.rubyforge.org for more information.

== Features

Multi-threaded design for high performance
Tracks 301 HTTP redirects
Built-in BFS algorithm for determining page depth
Allows exclusion of URLs based on regular expressions
Choose the links to follow on each page with focus_crawl()
HTTPS support
Records response time for each page
CLI program can list all pages in a domain, calculate page depths, and more
Obey robots.txt
In-memory or persistent storage of pages during crawl, using TokyoCabinet, SQLite3, MongoDB, or Redis

== Examples
See the scripts under the lib/anemone/cli directory for examples of several useful Anemone tasks.

== Requirements

nokogiri
robots

== Development
To test and develop this gem, additional requirements are:

rspec
fakeweb
tokyocabinet
kyotocabinet-ruby
mongo
redis
sqlite3

You will need to have KyotoCabinet, {Tokyo Cabinet}[http://fallabs.com/tokyocabinet/], {MongoDB}[http://www.mongodb.org/], and {Redis}[http://code.google.com/p/redis/] installed on your system and running.