LJ Archive

At the Forge

Aggregating with Atom

Reuven M. Lerner

Issue #127, November 2004

Want to give everyone a polite reminder when you have new content on your Web site? Give your site the latest syndication standard and you'll have a new tool to keep visitors coming back.

In the world of organized crime, a syndicate is a collection of gangsters who work together. In the world of newspapers, a syndicate distributes information to subscribers, allowing each publication to tailor the content of information it receives. Comics, news stories and opinion columns often are distributed by syndicates, providing greater exposure for the authors and more content for the readers.

In the past few years, Web developers also have begun to use the term syndicate, as both a verb and a noun. Fortunately for our safety, syndication on the Web has more in common with newspapers than with the mob. But as with organized crime, many people have been hurt in public disputes (albeit with words, not guns), leading to a split and a fair amount of acrimony in the world of Web syndication.

The result of this split is Atom, a new syndication format that has much in common with RSS (rich site summary or RDF site summary, depending on the version and whom you ask). I believe that Atom offers a number of advantages over any version of RSS, and that the simplicity with which Atom feeds can be created makes it an obvious choice over RSS. That said, the fact that most Weblog products provide RSS feeds means that the two camps happily can coexist for now. Understanding how both work also means your organization can decide to adopt one or both standards, depending on your needs.

Some History

As we saw last month, RSS really is two different formats, or more precisely, two different families of formats. RSS 0.9x and RSS 2.0 are from the same family and demonstrate the evolution, over time, of syndication on the Web. RSS 2.0 is maintained mainly by Dave Winer of Userland, scripting.com and (most recently) Harvard University. Winer has given ownership of the standard to Harvard but also has declared that version 2.0 will be the final one. Nevertheless, the combination of RSS 0.9x and RSS 2.0 represents a widespread, stable, well-understood and ambiguous protocol for syndicating Web content.

A separate flavor of RSS, confusingly known as RSS 1.0, uses the resource development framework (RDF) produced by the World Wide Web Consortium (W3C). RDF is designed to make it possible for computers to understand a site's contents, allowing it to make connections between sites, much as people instinctively do all the time. RSS 1.0 produces a summary that is incompatible with all other versions of RSS, using RDF to produce a standardized description of the site's contents.

The fact that RSS 1.0 used the RSS name caused a great deal of friction and animosity, with many people variously blaming Dave Winer, the vagueness of the RSS specification and the proponents of Atom's predecessor. At the end of the day, a number of prominent individuals—led by Tim Bray, Mark Pilgrim and Sam Ruby—were backed by such companies as Six Degrees (which publishes Movable Type software for Weblogs) to produce a specification, initially called PIE and Echo, which attempts to address the shortcomings of RSS.

The development of Atom took some time, because it involved understanding and defining exactly what syndication means on today's World Wide Web. RSS no longer is used only for news sites, its original target, but also for Weblogs and nontextual content. The developers decided to make internationalization a top priority, meaning that it should be possible to produce a syndication feed in any language. Another priority was the development of extensions—that is, it should be possible to add new functionality to the Atom feed without having to redefine the core Atom specification.

As of this writing (mid-August 2004), the Atom specification now exists in version 0.3, along with a standard API for editing content over the network. Atom has begun the process of becoming standardized by the IETF (the Internet Engineering Task Force, which produces and publishes Internet standards), meaning it is on its way to being a universally accepted standard, much like TCP/IP, SMTP or HTTP. This undoubtedly will lead to even greater interest in Atom from organizations that wait for the IETF's stamp of approval.

Atom is still in its initial stages, lacking public specifications for a number of items, such as its extension mechanism. But its authors have, to date, produced a standard whose complexity is fairly close to RSS 0.9x and 2.0, written in as unambiguous a fashion as possible, which includes many members of the Web syndication community and offers a vision of syndication that goes far beyond the Web.

Producing an Atom Feed

Although RSS was designed to summarize a news feed or Weblog, Atom was created with a more general purpose in mind. For example, factory machines could produce status reports in Atom, with an aggregator displaying those that are malfunctioning. Libraries could produce Atom feeds of the latest additions to their collections, with smart aggregators looking for books on certain subjects. Fax machines could be replaced by fax modems, using Atom to distribute fax images to appropriate groups of people.

You even could use Atom feeds to create a newspaper publishing system, where reporters send their stories not as e-mail, but instead publish drafts on an Atom feed. Each editor would aggregate Atom feeds from the reporters under his or her control, moving them onto an outgoing Atom feed when the editing was complete. The final feed would end up in the production department, where the text would be laid out and made ready for actual printing. The newspaper's content flow thus would be a flow of many Atom feeds into a single, final feed representing the newspaper itself.

Producing an Atom feed is fairly simple, if you use Perl or another high-level language for which an Atom library exists. Perl, for example, has the XML::Atom module, available from CPAN (Comprehensive Perl Archive Network). I had a bit of trouble installing XML::Atom on my machine running Fedora Core 2 and Perl 5.8.3, but I was able to work around it by ignoring the optional DateTime module during the installation process. I would not recommend doing so in a production environment.

Although XML::Atom is the overall package name, programs that create Atom feeds actually use XML::Atom::Feed and XML::Atom::Entry. Here is a short Perl program that produces a simple feed, based in part on the sample program in the perldoc on-line documentation for XML::Atom::Feed:

#!/usr/bin/perl

use strict;
use diagnostics;
use warnings;

use XML::Atom::Feed;
use XML::Atom::Entry;

# Create a new Atom feed
my $feed = XML::Atom::Feed->new;
$feed->title('My Weblog');

my $entry;
# Create a first entry for the feed
$entry = XML::Atom::Entry->new;
$entry->title('First Post');
$entry->content('First Post Body');
$feed->add_entry($entry);

# Create a second entry for the feed
$entry = XML::Atom::Entry->new;
$entry->title('Second Post');
$entry->content('Second Post Body');
$feed->add_entry($entry);

# Now produce the XML output
my $atom_feed_xml = $feed->as_xml;

# Display the XML output
print $atom_feed_xml, "\n";

The above program produces the following feed, which I have formatted with extra whitespace for easier reading:


<?xml version="1.0"?>
<feed xmlns="http://purl.org/atom/ns#">
<title>
    My Weblog
</title>
<entry >
    <title>
    First Post
    </title>
    <content mode="xml">
    <default:div xmlns="http://www.w3.org/1999/xhtml">
        First Post Body
    </default:div>
    </content>
</entry>
<entry >
    <title>
    Second Post
    </title>
    <content mode="xml">
    <default:div xmlns="http://www.w3.org/1999/xhtml">
        Second Post Body
    </default:div>
    </content>
</entry>
</feed>

As you can see, we create a single XML::Atom::Feed object, containing one or more instances of XML::Atom::Entry. Each entry object corresponds to a single <entry> tag in the Atom feed, which in turn represents a single entry in our Weblog or a single message from our factory floor.

The Atom specification indicates that the feed may contain a number of attributes and sub-elements, including a language, a description of the Weblog or site, copyright information and other general information about the originating site. Each entry, in turn, has its own set of elements, such as a title, an indication of when it was created and a summary. Each Atom element also has a MIME type indicating what type of content it contains, much like HTTP responses and e-mail attachments.

Of course, creating a feed, as in the above example, is necessary only if you are writing a new Atom-powered application or if you are adding Atom capabilities to a Weblog product. Most Weblog products now provide Atom feeds, either as part of their standard distribution or through a plugin or other extension mechanism. For example, an Atom feed plugin for the Blosxom Weblog product makes it easy to add such a feed from a Weblog; install the plugin (by placing it in the plugins directory), and anyone interested in receiving an Atom feed from the Weblog in question will be able to do so.

It shouldn't come as a surprise that this is so easy to accomplish, given the fact that Blosxom is written in Perl, that Perl provides excellent tools for working with XML and that the plugin simply needs to summarize and rewrite content from the most recent entries in the Weblog. Because Blosxom makes it so easy for plugins to modify the main page (so as to advertise the Atom feed) and to retrieve content (through the plugin API), it might be slightly easier to work with Atom from that product. Given that most Weblog products are written in a high-level language, such as Perl, Python or PHP, it should be easy to add an Atom feed where none currently exists.

Parsing an Atom Feed

To parse an Atom feed, either because we are writing an aggregator or because we want to create an Atom-powered application, we have several options. The easiest way is to continue to use XML::Atom::Feed to discover and retrieve feeds, for example:

#!/usr/bin/perl

use strict;
use diagnostics;
use warnings;

use XML::Atom::Feed;

# Get the Atom feeds for www.diveintomark.org
my @uris =
    XML::Atom::Feed->find_feeds(
        "http://www.diveintomark.org/");

    # Print each Atom feed URI
    foreach my $uri (@uris)
    {
    print "uri = '$uri'\n";
    }

In the above example, we see a single URI printed. Now that we know where the feed is, we can get a list of links in it, turning those links into XML:

#!/usr/bin/perl

use strict;
use diagnostics;
use warnings;

use XML::Atom::Feed;

# Get an Atom feed
my @uris = XML::Atom::Feed->find_feeds("http://www.diveintomark.org/");

foreach my $uri (@uris)
{
my $feed = XML::Atom::Feed->new(URI->new($uri));

my @links = $feed->link();

foreach my $link (@links)
{
    my $link_xml = $link->as_xml();
    print "link = '$link_xml\n";
}
}

Of course, we don't have to produce or display XML; we can parse the link information, sending new links to subscribers by e-mail, adding them to a database or ignoring those that fail to meet certain criteria.

Because Atom feeds are so regular, and because they operate using Internet standards such as XML, Unicode and MIME, we can be confident that the content our feed parses can be handled in straightforward ways. We can farm out different content types to different handlers, parse them in different ways and even (as in the newspaper example above) place them onto new feeds, becoming a super-aggregator.

If you are interested in creating an aggregator or in understanding how to work with the different myriad versions of RSS and Atom, it also is worth looking at Mark Pilgrim's feed aggregator. Written in Python and constantly updated, this is probably the best-documented piece of open-source engine for working with syndication feeds.

RSS or Atom?

So, should your Web site (or Weblog) provide syndication feeds in RSS, in Atom or in both? It is clear to me that Atom is the best of the two (or three) syndication format families produced to date. Dave Winer's RSS formats were groundbreaking when they were released, but they have too many problems to form the basis of full-fledged, enterprise-ready standards. We have seen the agony that results from half-baked standards, such as early versions of HTML and JavaScript, and given that syndication stands a good chance of becoming an important communication mechanism, completeness and unambiguity are important factors to consider.

It is similarly important to consider the growing international use of the Internet and that people want to syndicate media other than text. Atom's lack of ambiguity regarding special characters is another big step forward, ensuring that we can include < and > in our Weblog entries without having to worry about the implications for syndication. Most important, the planned provisions for extensions will make it possible for Atom to meet the needs of specific groups and applications without opening the entire specification anew.

Although Atom is remarkably complete, it is also straightforward to use. A great deal of time and energy clearly have been put into making Atom as easy to use as possible. Creating a new API is not a simple task, particularly when it is meant to be as general as possible.

Finally, the mess of RSS version numbers that resulted in (and from) petty and political arguments has served no one very well. Because Atom has a different name, although literally an issue of semantics, it reduces the confusion that developers and users alike face when working with RSS.

Conclusion

Atom is an attempt to solve many of the problems associated with RSS and to turn syndication into a building block for new types of high-level communication across Internet applications. Atom is slightly more complicated than Dave Winer's versions of RSS, but it is less complicated (in its initial version) than RSS 1.0, which used RDF to describe and summarize Web sites. The combination of easy-to-use software tools for working with Atom feeds, its extensibility and the authors' commitment to being a part of the Internet standards community, makes it clear that Atom will play a key role in the future of Web communication.

Resources for this article: /article/7751.

Reuven M. Lerner, a longtime Web/database consultant and developer, now is a graduate student in the Learning Sciences program at Northwestern University. His Weblog is at altneuland.lerner.co.il, and you can reach him at reuven@lerner.co.il.

LJ Archive