A simple Ruby web spider that uses Anemone to crawl every page of a site looking for email addresses. Stores the results with SQLite3 using Data Mapper.
This web spider harvests any email addresses that it can find on the target web site. It stores the
harvested addresses in a SQLite database file. Each address also includes information about the site
and the page where it was harvested, and the time that it was discovered.
I recommend using RVM to set up the Ruby 1.9.2 environment and a gemset. An .rvmrc file is included.
Then bundle the gems with: bundle install
Invoke the spider with:
ruby crawl.rb http://target.com
The spider will display the URL of each page as it crawls the web site. It will write out a pages.pstore
file for keeping track of the pages that it has crawled, and a data.db file for storing harvested
addresses.
To export the addresses from the database, us the “export” Rake task:
rake export
You should see output like this:
[~/projects/email_spider] rake export
31 addresses exported to addresses.csv
Each row in the exported data contains the email address, the date/time that it was collected, the host
of the site where it was collected, and the URL for the specific page where it was found.