Pure Ruby parser for RSS and Atom feeds designed to be easy to understand and extend. Uses REXML or built-in "tag soup" parser. Supports extensions including Dublin Core, iTunes podcasts, and extensions from Google and Feedburner.
= Syndication 0.6
This module provides classes for parsing web syndication feeds in RSS and
Atom formats.
To parse RSS, use Syndication::RSS::Parser.
To parse Atom, use Syndication::Atom::Parser.
If you want my advice on which to generate, my order of preference would
be:
My reasoning is simply that I hate having to sniff for HTML (see
Syndication::RSS).
== License
Syndication is Copyright 2005-2011 mathew [email protected], and is licensed
under the same terms as Ruby.
== Requirements
Built and tested using Ruby 1.9.2. Needs only the standard library.
== Rationale
Ruby already has an RSS library as part of the standard library, so you
might be wondering why I decided to write another one.
I started out trying to document the standard rss module, but found the
code rather impenetrable. It was also difficult to see how it could be made
documentable via Rdoc.
Then I tried writing code to use the standard RSS library, and discovered
that it had a number of (what I consider to be) defects:
It didn’t support RSS 2.0 with extensions (such as iTunes podcast feeds),
and it wasn’t clear to me how to extend it to do so.
It didn’t support RSS 0.9.
It didn’t support Atom.
The API is different depending on what kind of RSS feed you are parsing.
I asked around, and discovered that I wasn’t the only person dissatisfied
with the RSS library. Since fixing the problems would have resulted in
breaking existing code that used the RSS module, I opted for an all-new
implementation.
This is the result. The first release was version 0.4, which was actually my
fourth attempt at putting together a clean, simple, universal API for RSS
and Atom parsing. (The first three never saw public release.)
== Features
Here are what I see as the key improvements over the rss module in the
Ruby standard library:
Supports all RSS versions, including RSS 0.9, as well as Atom.
Provides a unified API/object model for accessing the decoded data,
with no need to know what format the feed is in.
Allows use of extended RSS 2.0 feeds.
Simple API, fully documented.
Test suite with over 220 test assertions.
Commented source code.
Less source code than the standard library rss module.
Faster than the standard library (at least, in my tests).
Other features:
Optional support for RSS 1.0 Dublin Core, Syndication and Content modules,
Apple iTunes Podcast elements, and Google Calendar.
Content module decodes CDATA-escaped or encoded HTML content for you.
Supports namespaces, and encoded XHTML/HTML in Atom feeds.
Dates decoded to Ruby DateTime objects. Note, however, that this is slow,
so parsing is only performed if you ask for the value.
Simple to extend to support your own RSS extensions, uses reflection.
Uses REXML fast stream parsing API for speed, or built-in TagSoup parser
for invalid feeds.
Non-validating, tries to be as forgiving as possible of structural errors.
Remaps namespace prefixes to standard values if it recognizes the module’s
URL.
In the interests of balance, here are some key disadvantages over the
standard library RSS support:
No support for generating RSS feeds, only for parsing them. If
you’re using Rails, you can use RXML; if not, you can use rss/maker.
My feeling is that XML generation isn’t a wheel that needs reinventing.
Different API, not a drop-in replacement.
Incomplete support for Atom 0.3 draft. (Anyone still using it?)
No support for base64 data in Atom feeds (yet).
No Japanese documentation.
No XSL output options.
Slower if there are dates in the feed and you ask for their values.
== Other options
There are, of course, other Ruby RSS/Atom libraries out there. The ones I
know about:
= simple-rss
http://rubyforge.org/projects/simple-rss
Pros:
Much smaller than syndication or rss.
Completely non-validating.
Backwards compatible with rss in standard library.
Cons:
Doesn’t use a real XML parser.
No support for namespaces.
Incomplete Atom support (e.g. can’t get name and e-mail of atom:person
elements as separate fields, you still have to decode XHTML data yourself)
No documentation.
For the record, I started work on my library long before simple-rss was
announced.
= feedtools
http://rubyforge.org/projects/feedtools/
This one solves most of the same problems as Syndication; however the two
were developed in parallel, in ignorance of each other.
Feedtools builds in database caching and persistance, and HTTP fetching.
Personally, I don’t think those belong in a feed parsing library–they
are easily implemented using other standard libraries if you want them.
Pros:
Lots of test cases.
Used by lots of Rails people.
Knows about many more namespaces.
Can generate feeds.
Cons:
Skimpy documentation.
Uses HTree then XPath parsing, rather than a single stream parse.
Tries to unify RSS and Atom APIs, at the expense of Atom functionality.
(Which could also be a pro, depending on your viewpoint.)
== Design philosophy
Here’s my design philosophy for this module:
The interface should be via standard Ruby objects and methods; e.g.
feed.channel.item[0].title, rather than (say) a dictionary hash.
It should be easier to parse RSS via the module than to hack something
together using REXML, even if all you want is a list of titles and URLs.
It should be easy to add support for new RSS extensions without needing
to know anything about reflection or other advanced topics. Just define
a mixin with a bunch of appropriately-named methods, and you’re done.
The code should be simple to understand.
Even so, good complete documentation is extremely important.
Be lenient in what you accept.
Be conservative in what you generate.
Get well-formed feeds parsing reliably, then worry about broken feeds.
Atom will hopefully be the future. Provide full support for RSS, but don’t
hold Atom back by trying to force it into an RSS data model.
== Future plans
Here are some possible improvements:
Create objects, then call Syndication::FeedMaker to generate XML in various
flavors. This probably won’t happen until an XML generator is picked for the
Ruby standard library.
It turns out that when I asked for parsed dates in my test code, the profiler
showed Date.parse chewing up 25% of the total CPU time used. A more specific
ISO8601 specific date parser could cut that down drastically.
I just wanted to be able to display my upcoming calendar dates, but clearly
there is a lot more that could be implemented. Unfortunately, recurring events
don’t seem to have a clean XML representation in Google’s data feeds yet.
== Feedback
There are doubtless things I could have done better. Comments, suggestions,
etc are welcome; e-mail [email protected].