đ° Build RSS 2.0 feeds from websites (and JSON APIs) with a few CSS selectors.
This Ruby gem builds RSS 2.0 feeds from a feed config.
With the feed config containing the URL to scrape and
CSS selectors for information extraction (like title, URL, âŚ) your RSS builds.
Extractors and chain-able post processors
make information extraction, processing and sanitizing a breeze.
Scraping JSON responses and
setting HTTP request headers is
supported, too.
Searching for a ready to use app which serves generated feeds via HTTP?
Head over to html2rss-web
!
To support the development, feel free to sponsor this project on Github. Thank you! đ
Install | gem install html2rss |
---|---|
Usage | html2rss help |
You can also install it as a dependency in your Ruby project:
𤊠Like it? | Star it! âď¸ |
---|---|
Add this line to your Gemfile : |
gem 'html2rss' |
Then execute: | bundle |
In your code: | require 'html2rss' |
Create a file called my_config_file.yml
with this example content:
channel:
url: https://stackoverflow.com/questions
selectors:
items:
selector: "#hot-network-questions > ul > li"
title:
selector: a
link:
selector: a
extractor: href
Build the RSS with: html2rss feed ./my_config_file.yml
.
Hereâs a minimal working example within Ruby:
require 'html2rss'
rss =
Html2rss.feed(
channel: { url: 'https://stackoverflow.com/questions' },
selectors: {
items: { selector: '#hot-network-questions > ul > li' },
title: { selector: 'a' },
link: { selector: 'a', extractor: 'href' }
}
)
puts rss
A feed config consists of a channel
and a selectors
Hash.
The contents of both hashes are explained in the chapters below.
Good to know:
spec/*.test.yml
.html2rss-configs
for ready-made feed configs!html2rss-configs
to make your config available to the general public.Alright, letâs move on.
channel
attribute | type | default | remark | |
---|---|---|---|---|
url |
required | String | ||
title |
optional | String | auto-generated | |
description |
optional | String | auto-generated | |
ttl |
optional | Integer | 360 |
TTL in minutes |
time_zone |
optional | String | 'UTC' |
TimeZone name |
language |
optional | String | 'en' |
Language code |
author |
optional | String | Format: email (Name) |
|
headers |
optional | Hash | {} |
Set HTTP request headers. See notes below. |
json |
optional | Boolean | false |
Handle JSON response. See notes below. |
channel
attributesSometimes there are structurally equal pages with different URLs. In such a case you can add dynamic parameters to the channelâs attributes.
Example of a dynamic id
parameter in the channel URLs:
channel:
url: "http://domainname.tld/whatever/%<id>s.html"
Command line usage example:
bundle exec html2rss feed the_feed_config.yml id=42
config = Html2rss::Config.new({ channel: { url: 'http://domainname.tld/whatever/%<id>s.html' } }, {}, { id: 42 })
Html2rss.feed(config)
See the more complex formatting of the sprintf
method for formatting options.
selectors
First, you must give an items
selector hash which contains a CSS selector. The selector selects a collection of HTML tags from which the RSS feed items are build.
Except the items
selector, all other keys are scoped to each item of the collection.
Then, to build a
valid RSS 2.0 item,
you need to have at least a title
or a description
. You can have both.
Having an items
and a title
selector is already enough to build a simple feed.
Your selectors
Hash can contain arbitrary named selectors, but only a few will make it into the RSS feed (This due to the RSS 2.0 specification):
RSS 2.0 tag | name in html2rss |
remark |
---|---|---|
title |
title |
|
description |
description |
Supports HTML. |
link |
link |
A URL. |
author |
author |
|
category |
categories |
See notes below. |
guid |
guid |
Default title/description. See notes below. |
enclosure |
enclosure |
See notes below. |
pubDate |
updated |
An instance of Time . |
comments |
comments |
A URL. |
source |
Not yet supported. |
selector
hashEvery named selector in your selectors
hash can have these attributes:
name | value |
---|---|
selector |
The CSS selector to select the tag with the information. |
extractor |
Name of the extractor. See notes below. |
post_process |
A hash or array of hashes. See notes below. |
Extractors help with extracting the information from the selected HTML tag.
text
, which returns the tagâs inner text.html
extractor returns the tagâs outer HTML.href
extractor returns a URL from the tagâs href
attribute and corrects relative ones to absolute ones.attribute
extractor returns the value of that tagâs attribute.static
extractor returns the configured static value (it doesnât extract anything).Extractors might need extra attributes on the selector hash.
đ Read their docs for usage examples.
Html2rss.feed(
channel: {}, selectors: { link: { selector: 'a', extractor: 'href' } }
)
channel:
  # ... omitted
selectors:
  # ... omitted
link:
selector: 'a'
extractor: 'href'
Extracted information can be further manipulated with post processors.
name | |
---|---|
gsub |
Allows global substitution operations on Strings (Regexp or simple pattern). |
html_to_markdown |
HTML to Markdown, using reverse_markdown. |
markdown_to_html |
converts Markdown to HTML, using kramdown. |
parse_time |
Parses a String containing a time in a time zone. |
parse_uri |
Parses a String as URL. |
sanitize_html |
Strips unsafe and uneeded HTML and adds security related attributes. |
substring |
Cuts a part off of a String, starting at a position. |
template |
Based on a template, it creates a new String filled with other selectors values. |
â ď¸ Always make use of the sanitize_html
post processor for HTML content. Never trust the internet! â ď¸
đ Read their docs for usage examples.
Html2rss.feed(
channel: {},
selectors: {
description: {
selector: '.content', post_process: { name: 'sanitize_html' }
}
}
)
channel:
  # ... omitted
selectors:
  # ... omitted
description:
selector: '.content'
post_process:
- name: sanitize_html
Pass an array to post_process
to chain the post processors.
channel:
  # ... omitted
selectors:
  # ... omitted
price:
selector: '.price'
description:
selector: '.section'
post_process:
- name: template
string: |
# %{self}
Price: %{price}
- name: markdown_to_html
Note the use of |
for a multi-line String in YAML.
<category>
tags to an itemThe categories
selector takes an array of selector names. Each value of those
selectors will become a <category>
on the RSS item.
Html2rss.feed(
channel: {},
selectors: {
genre: {
# ... omitted
selector: '.genre'
},
branch: { selector: '.branch' },
categories: %i[genre branch]
}
)
channel:
  # ... omitted
selectors:
# ... omitted
genre:
selector: ".genre"
branch:
selector: ".branch"
categories:
- genre
- branch
By default, html2rss generates a GUID from the title
or description
.
If this does not work well, you can choose other attributes from which the GUID is build.
The principle is the same as for the categories: pass an array of selectors names.
In all cases, the GUID is a SHA1-encoded string.
Html2rss.feed(
channel: {},
selectors: {
title: {
# ... omitted
selector: 'h1'
},
link: { selector: 'a', extractor: 'href' },
guid: %i[link]
}
)
channel:
  # ... omitted
selectors:
# ... omitted
title:
selector: "h1"
link:
selector: "a"
extractor: "href"
guid:
- link
<enclosure>
tag to an itemAn enclosure can be any file, e.g. a image, audio or video.
The enclosure
selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channelâs URL as base.
Since html2rss
does no further inspection of the enclosure, its support comes with trade-offs:
application/octet-stream
.0
bytes.Read the RSS 2.0 spec for further information on enclosing content.
Html2rss.feed(
channel: {},
selectors: {
enclosure: { selector: 'img', extractor: 'attribute', attribute: 'src' }
}
)
channel:
  # ... omitted
selectors:
  # ... omitted
enclosure:
selector: "img"
extractor: "attribute"
attribute: "src"
Although this gemâs name is htmlâ2rss, itâs possible to scrape and process JSON.
Adding json: true
to the channel config will convert the JSON response to XML.
Html2rss.feed(
channel: {
url: 'https://example.com', json: true
},
selectors: {} # ... omitted
)
channel:
url: https://example.com
json: true
selectors:
  # ... omitted
This JSON object:
{
"data": [{ "title": "Headline", "url": "https://example.com" }]
}
converts to:
<object>
<data>
<array>
<object>
<title>Headline</title>
<url>https://example.com</url>
</object>
</array>
</data>
</object>
Your items selector would be array > object
, the itemâs link
selector would be url
.
This JSON array:
[{ "title": "Headline", "url": "https://example.com" }]
converts to:
<array>
<object>
<title>Headline</title>
<url>https://example.com</url>
</object>
</array>
Your items selector would be array > object
, the itemâs link
selector would be url
.
You can add any HTTP headers to the request to the channel URL.
Use this to e.g. have Cookie or Authorization information sent or to spoof the User-Agent.
Html2rss.feed(
channel: {
url: 'https://example.com',
headers: {
'User-Agent': 'html2rss-request',
'X-Something': 'Foobar',
Authorization: 'Token deadbea7',
Cookie: 'monster=MeWantCookie'
}
},
selectors: {}
)
channel:
url: https://example.com
headers:
"User-Agent": "html2rss-request"
"X-Something": "Foobar"
"Authorization": "Token deadbea7"
"Cookie": "monster=MeWantCookie"
selectors:
  # ...
The headers provided by the channel are merged into the global headers.
By default, html2rss
keeps the order of the collection returned from the items
selector. The items
selector hash can optionally contain an order
attribute.
If its value is reverse
, the order of items in the RSS will reverse.
channel:
  # ... omitted
selectors:
items:
selector: 'ul > li'
order: 'reverse'
  # ... omitted
Note that the order of items, according to the RSS 2.0 spec, should not matter to the feed-consuming client.
This step is not required to work with this gem. If youâre using
html2rss-web
and want to create your private feed configs, keep on reading!
First, create a YAML file, e.g. feeds.yml
. This file will contain your global config and multiple feed configs under the key feeds
.
Example:
headers:
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
feeds:
myfeed:
channel:
selectors:
myotherfeed:
channel:
selectors:
Your feed configs go below feeds
. Everything else is part of the global config.
Find a full example of a feeds.yml
at spec/feeds.test.yml
.
Now you can build your feeds like this:
require 'html2rss'
myfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myfeed')
myotherfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myotherfeed')
$ html2rss feed feeds.yml myfeed
$ html2rss feed feeds.yml myotherfeed
To display RSS feeds nicely in a web browser, you can:
A web browser will apply these stylesheets and show the contents as described.
In a CSS stylesheet, youâd use element
selectors to apply styles.
If you want to do more, then you need to create a XSLT. XSLT allows you
to use a HTML template and to freely design the information of the RSS,
including using JavaScript and external resources.
You can add as many stylesheets and types as you like. Just add them to your global configuration.
config = Html2rss::Config.new(
{ channel: {}, selectors: {} }, # omitted
{
stylesheets: [
{
href: '/relative/base/path/to/style.xls',
media: :all,
type: 'text/xsl'
},
{
href: 'http://example.com/rss.css',
media: :all,
type: 'text/css'
}
]
}
)
Html2rss.feed(config)
stylesheets:
- href: "/relative/base/path/to/style.xls"
media: "all"
type: "text/xsl"
- href: "http://example.com/rss.css"
media: "all"
type: "text/css"
feeds:
# ... omitted
Recommended further readings:
html2rss
does not execute JavaScript.curl
and pup
to find the selectors seems efficient (curl URL | pup
).After checking out the repository, run bin/setup
to install dependencies. Then, run bundle exec rspec
to run the tests.
You can also run bin/console
for an interactive prompt that will allow you to experiment.
git pull
lib/html2rss/version.rb
bundle
git add Gemfile.lock lib/html2rss/version.rb
VERSION=$(ruby -e 'require "./lib/html2rss/version.rb"; puts Html2rss::VERSION')
git commit -m "chore: release $VERSION"
git tag v$VERSION
standard-changelog -f
git add CHANGELOG.md && git commit --amend
git tag v$VERSION -f
git push && git push --tags
Bug reports and pull requests are welcome on GitHub at https://github.com/html2rss/html2rss.