Web-crawling framework for Ruby
Makuri is a Web-crawling framework for Ruby.
Add this to your application’s Gemfile
gem 'makuri'
And execute
$ bundle
Or install it as:
$ gem install makuri
In this example, we are going to crawl the quotes website and scrape data as:
# quotes_spider.rb
require 'makuri'
class QuotesSpider
include Makuri::Spider
start_urls ['https://quotes.toscrape.com/tag/humor/']
def parse
response.css('div.quote').each { |quote| extract(quote) }
next_page = response.at_css('li.next>a')
request_to :parse, url: next_page[:href] unless next_page.nil?
end
def extract(quote)
item = {
author: quote.at_css('span>small').text,
text: quote.at_css('span.text').text
}
puts item.to_json
end
end
QuotesSpider.run
Now save the file to quotes_spider.rb
file and run it as:
$ ruby quotes_spider.rb > quotes.json
When it’s done, you will find all the quotes saved to quotes.json
file. It’s that easy.
Now, let’s try to scrape another site with JavaScript rendered site with Dynamic HTML and infinite scroll:
# infinite_scroll_spider.rb
require 'makuri'
class InfiniteScrollSpider
include Makuri::Spider
spider_options engine: :ferrum, headless: true
start_urls ['https://infinite-scroll.com/demo/full-page/']
def parse
post_title_xpath = '//article/h2'
count = response.xpath(post_title_xpath).count
current_response = nil
loop do
browser.page.execute('window.scrollBy(0,10000)'); sleep 2
current_response = browser.current_response
new_count = current_response.xpath(post_title_xpath).count
logger.info '> Pagination is done' and break if count == new_count
count = new_count
logger.info "> Continue scrolling, current count is #{count}..."
end
posts_headers = current_response.xpath(post_title_xpath).map(&:text)
logger.info "> All post titles: #{posts_headers.join('; ')}"
end
end
Now save the file to infinite_scroll_spider.rb
and run it as:
$ ruby infinite_scroll_spider.rb
View the changelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
To get started with development:
$ git clone https://github.com/lalusaud/makuri.git
$ cd makuri
$ bundle install
$ bundle exec rake test