Ruby Weekly is a weekly newsletter covering the latest Ruby and Rails news.

scRUBYt – Hot, New Ruby Web-Scraping Toolkit Released

By Peter Cooper / February 6, 2007

Scrubyt

For the past few months Peter Szinek has been giving me lots of tasty tidbits about his forthcoming ScRUBYt Web-scraping toolkit, and now it's finally fully released to the public! Peter describes ScRUBYt as "WWW::Mechanize and Hpricot on Steroids" and this description is pretty bang on.

Process
As well as providing a simple DSL for performing Web actions (clicking links, submitting forms, etc.), one of ScRUBYt's most impressive features is that you can provide it with 'example' data from which it will extrapolate a search pattern and then find any other similar data within the same page. This is demonstrated perfectly by Peter's basic example:

ebay_data = Scrubyt::Extractor.define do
   fetch 'http://www.ebay.com/'
   fill_textfield 'satitle', 'ipod'
   submit
   click_link 'Apple iPod'

   record do
     item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
     price '$71.99'
   end
   next_page 'Next >', :limit => 5
end

This code goes to ebay.com, looks for iPods, and then extracts all records using a dummy one as an example. It then proceeds through up to 5 more pages of records, returning them all as an XML dataset.

If this all floats your boat, there's a lot to explore. Start off with the official site and Peter's comprehensive announcement. Peter also has a lengthy tutorial available which makes good reading.

Comments

  1. Tom Sparplan says:

    This is rather impressive. The end result could just as well be done with Curl, but this way, it is a lot clearer to understand in the source. On the downside, this script will stop working when pricing changes or the item does not show on the first page any more.

  2. Peter Szinek says:

    Tom,

    This is simply not true :-)

    as Peter also pointed out, this is just a dummy example. The system learns how to extract similar examples, then the learned rules are extracted - and those are agnostic to any older example or anything, thus they will work until the page *structure* is changed - then you must provide actual examples to learn the new rules... working on the automatization of this, btw.

Other Posts to Enjoy

Twitter Mentions