HTMLからRSSフィードを検出するRubyスクリプトを作ってみた。

ブラウザのアドレスバーに表示されるRSSフィードはどんな仕組みで検出しているのか疑問に思ったので調べてみました。

HTMLのHEADタグの中にのような形式でRSSフィードを指定すると、ブラウザが検出してくれるらしい。

nokogiriを使えば実装できそうだったので、Rubyで実装してみました。

・RSSAutoDiscovery.rb

# encoding: utf-8                                                                                                                                      
require 'rubygems'
require 'nokogiri'

class RSSAutoDiscovery
    # xpath for feed
    RSS_xpath  = '//link[@rel="alternate"][@type="application/rss+xml"]'
    Atom_xpath = '//link[@rel="alternate"][@type="application/atom+xml"]'

    public
    def RSSAutoDiscovery.discover(html)

        # create html from string
        html = Nokogiri::HTML(html)

        # discover rss and atom
        @rss_feeds = discoverFeed(html, RSS_xpath)
        @atom_feeds = discoverFeed(html, Atom_xpath)

        return @rss_feeds + @atom_feeds
    end

    private
    def RSSAutoDiscovery.discoverFeed(html, feed_xpath)

        # feed list
        @feeds = Array.new

        # discover feed
        html.xpath(feed_xpath).each do |link|

            # get feed title and url
            @feed_title = link.attribute("title")
            @feed_url = link.attribute("href")

            # push hash to array
            @feeds << {"title" => @feed_title, "url" => @feed_url}
        end

        return @feeds
    end
end

・Sample.rb

# encoding: utf-8                                                                                                                                      
require 'rubygems'
require 'open-uri'
require "./RSSAutoDiscovery.rb"

# HTMLを取得する
def getHTML(url)
    html = nil

    begin
        html = open(url).read
    rescue OpenURI::HTTPError => ex
        if ex.io.status[0] == "304" then
            warn ex.message
        else
            raise ex
        end
    end

    return html
end


# メイン
urls = Array.new
urls << "http://d.hatena.ne.jp/yukihir0/"
urls << "http://www.lifehacker.jp/"

urls.each do |url|
    html = getHTML(url)

    unless html.nil?
        feeds = RSSAutoDiscovery.discover(html)

        puts "--- #{url} ---"
        feeds.each do |feed|
            puts "#{feed['title']} : #{feed['url']}"
        end
        (url.length+8).times {
            print "-"
        }
        puts "\n\n"
    else
        puts "can't get html"
    end

end

・実行結果

%ruby Sample.rb

--- http://d.hatena.ne.jp/yukihir0/ ---
RSS : http://d.hatena.ne.jp/yukihir0/rss
RSS 2.0 : http://d.hatena.ne.jp/yukihir0/rss2
---------------------------------------

--- http://www.lifehacker.jp/ ---
RSS 2.0 : http://www.lifehacker.jp/index.xml
Atom : http://www.lifehacker.jp/atom.xml
---------------------------------

いくつかサイトをピックアップして試したみたけど、type属性にrssじゃなくてatomを指定しているページがあったので両方対応させてみました。

仕様上正しいのかどうかは不明です。

nokogiri便利ー。