Ruby Weekly is a weekly newsletter covering the latest Ruby and Rails news.

Ruby gets a stylish HTML scraper – scrAPI

By Peter Cooper / July 12, 2006

Scrapi

The indefatigable Assaf Arkin has done it again by developing a new Ruby HTML scraping toolkit, scrAPI. Peter Szinek recently wrote a popular article about scraping from Ruby using Manic Miner, RubyfulSoup, REXML, and WWW::Mechanize, but none of these are as immediately useful as scrAPI.. so why?

scrAPI lets you scrape from HTML using CSS selectors. For example, here's Assaf's example that defines scraper objects that can scrape auctions from eBay:

ebay_auction = Scraper.define do
    process "h3.ens>a", :description=>:text,
                        :url=>"@href"
    process "td.ebcPr>span", :price=>:text
    process "div.ebPicture >a>img", :image=>"@src"

    result :description, :url, :price, :image
end

ebay = Scraper.define do
    array :auctions

    process "table.ebItemlist tr.single",
            :auctions => ebay_auction

    result :auctions
end

Now that the objects are set up ready to scrape, you can put them into action like so:

auctions = ebay.scrape(html)

# No. of auctions found
puts auctions.size

# First auction:
auction = auctions[0]
puts auction.description
puts auction.url

Simple example with serious power. Go get scrAPI and play.

Comments

  1. Danno says:

    I'm not on the up and up with Page Scraping, how does this compare to _why's Hpricot?

  2. Peter Cooper says:

    Hpricot lets you pull certain elements from a page programatically.. whereas this kinda bundles that sort of functionality into a reusable pattern. So rather than 'get this, then get this', this is.. 'get each of these things and return them to me in a solid lump'.

  3. Michael @ SEOG says:

    That looks really interesting. Do you think you could post an example with the original HTML as well? So that we can see from original document, to scrAPI code, to the final output?

    It looks like it might be a much more elegant solution for those of us looking to build databases of information from other sites and need an easier way to do that.

    thanks!

  4. assaf says:

    Michael,

    The original HTML for this example is an eBay page with search results. For the demo I did, I just searched for "iPod nano", saved the page and ran this code on the saved page.

Other Posts to Enjoy

Twitter Mentions