A Ruby DSL for structured web crawling, with a robust caching system.
Sinew is a Ruby library for collecting data from web sites (scraping). Though small, this project is the culmination of years of effort based on crawling systems built at several different companies. Sinew has been used to crawl millions of websites.
# install gem
$ gem install sinew
# or add to your Gemfile:
gem 'sinew'
Breaking change
We are pleased to announce the release of Sinew 4. The Sinew DSL exposes a single sinew
method in lieu of the many methods exposed in Sinew 3. Because of this single entry point, Sinew is now much easier to embed in other applications. Also, each Sinew 4 request returns a full Response object to faciliate parallelism.
Sinew uses the Faraday HTTP client with the httpdisk middleware for aggressive caching of responses.
Here’s an example for collecting the links from httpbingo.org. Paste this into a file called sample.sinew
and run sinew sample.sinew
. It will create a sample.csv
file containing the href and text for each link:
# get the url
response = sinew.get "https://httpbingo.org"
# use nokogiri to collect links
response.noko.css("ul li a").each do |a|
row = { }
row[:url] = a[:href]
row[:title] = a.text
# append a row to the csv
sinew.csv_emit(row)
end
There are three main features provided by Sinew.
Sinew uses recipe files to crawl web sites. Recipes have the .sinew extension, but they are plain old Ruby. Here’s a trivial example that calls get
to make an HTTP GET request:
response = sinew.get "https://www.google.com/search?q=darwin"
response = sinew.get "https://www.google.com/search", q: "charles darwin"
Once you’ve done a get
, you can access the document in a few different formats. In general, it’s easiest to use noko
to automatically parse and interact with HTML results. If Nokogiri isn’t appropriate, fall back to regular expressions run against body
or html
. Use json
if you are expecting a JSON response.
response = sinew.get "https://www.google.com/search?q=darwin"
# pull out the links with nokogiri
links = response.noko.css("a").map { _1[:href] }
puts links.inspect
# or, use a regex
links = response.html[/<a[^>]+href="([^"]+)/, 1]
puts links.inspect
Recipes output CSV files. To continue the example above:
response = sinew.get "https://www.google.com/search?q=darwin"
response.noko.css("a").each do |i|
row = { }
row[:href] = i[:href]
row[:text] = i.text
sinew.csv_emit row
end
Sinew creates a CSV file with the same name as the recipe, and csv_emit(hash)
appends a row. The values of your hash are cleaned up and converted to strings:
Sinew uses httpdisk to aggressively cache all HTTP responses to disk in ~/.sinew
. Error responses are cached as well. Each URL will be hit exactly once, and requests are rate limited to one per second. Sinew tries to be polite.
Sinew never deletes files from the cache - that’s up to you! Sinew has various command line options to refresh the cache. See --expires
, --force
and --force-errors
.
Because all requests are cached, you can run Sinew repeatedly with confidence. Run it over and over again while you work on your recipe.
The sinew
command line has many useful options. You will be using this command many times as you iterate on your recipe:
$ bin/sinew --help
Usage: sinew [options] [recipe]
-l, --limit quit after emitting this many rows
--proxy use host[:port] as HTTP proxy
--timeout maximum time allowed for the transfer
-s, --silent suppress some output
-v, --verbose dump emitted rows while running
From httpdisk:
--dir set custom cache directory
--expires when to expire cached requests (ex: 1h, 2d, 3w)
--force don't read anything from cache (but still write)
--force-errors don't read errors from cache (but still write)
Sinew
also has many runtime options that can be set by in your recipe. For example:
sinew.options[:headers] = { 'User-Agent' => 'xyz' }
...
Here is the list of available options for Sinew
:
sinew.get(url, params = nil, headers = nil)
- fetch a url with GETsinew.post(url, body = nil, headers = nil)
- fetch a url with POST, using form
as the URL encoded POST body.sinew.post_json(url, body = nil, headers = nil)
- fetch a url with POST, using json
as the POST body.Each request method returns a Sinew::Response
. The response has several helpers to make parsing easier:
body
- the raw bodyhtml
- like body
, but with a handful of HTML-specific whitespace cleanupsnoko
- parse as HTML and return a Nokogiri documentxml
- parse as XML and return a Nokogiri documentjson
- parse as JSON, with symbolized keysmash
- parse as JSON and return a Hashie::Mashurl
- the url of the request. If the request goes through a redirect, url
will reflect the final url.sinew.csv_header(columns)
- specify the columns for CSV output. If you don’t call this, Sinew will use the keys from the first call to sinew.csv_emit
.sinew.csv_emit(hash)
- append a row to the CSV fileSinew has some advanced helpers for checking the httpdisk cache. For the following methods, body
hashes default to form body type.
sinew.cached?(method, url, params = nil, body = nil)
- check if request is cachedsinew.uncache(method, url, params = nil, body = nil)
- remove cache file, if anysinew.status(method, url, params = nil, body = nil)
- get httpdisk statusPlus some caching helpers in Sinew::Response:
diskpath
- the location on disk for the cached httpdisk responseuncache
- remove cache file for this responseWriting Sinew recipes is fun and easy. The builtin caching means you can iterate quickly, since you won’t have to re-fetch the data. Here are some hints for writing idiomatic recipes:
$
in the console is your friend.body
or html
. html
is probably your best bet. body
is good for crawling Javascript, but it’s fragile if the site changes.String#[regexp]
, which is an obscure operator but incredibly handy for Sinew.noko.css("table")[4].css("td").select do
_1[:width].to_i > 80
end.map(&:text)
puts
, or better yet use ap
from amazing_print.sinew -v
to get a report on every csv_emit
. Very handy.--proxy host1,host2,...
)code
, a peer to uri
, raw
, etc.--limit
, --proxy
and the xml
variablehead
files from Sinew 1…
This extension is licensed under the MIT License.