Ruby Weekly is a weekly newsletter covering the latest Ruby and Rails news.

Easy Web Spidering in Ruby with Anemone

By Ric Roberts / July 2, 2009

anemone Anemone is a free, multi-threaded Ruby web spider framework from Chris Kite, which is useful for collecting information about websites. With Anemone you can write tasks to generate some interesting statistics on a site just by giving it the URL.

Its only dependency is Nokogiri (an HTML and XML parser). Other than that, you just need to install the gem to get started using Anemone's simple syntax which, among other things, allows you to tell it which pages to include (based on regular expressions) or define callbacks.

This example taken from Anemone's homepage prints out the URL of every page on a site:

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
    puts page.url
  end
end

The bin folder in the project contains some more in-depth examples, including tasks for counting the number of unique pages on a site, the number of pages at a certain depth in a site, or a list of urls encountered.  There's also a combined-task which wraps up a few of these, intended to be run as a daily cron job.

You can install Anemone as a gem or get the source from Github of course, and there's some fairly comprehensive RDoc documentation available in the source or online.

rupho.pngAlso worth seeing.. Mobile Orchard's Beginning iPhone Programming Workshop. Bay Area/July 30-31. Seattle/Aug 20-21. Ruby Inside discount of $200 -- use "ri" discount code.

Comments

  1. ben says:

    Cool but is there anyone that made it work ?
    from the given example, just got the homepage, no crawl.

    searching further, got on spidr wich seems to do the same work with kind of the same syntax; it just fails.

    so for the time spent, disappointed…

  2. Ric says:

    Hi Ben. I gave some of the examples a try, and they worked for me.

  3. Soleone says:

    Just for your information: there's also a dependency on facets it seems.

  4. Soleone says:

    Hmm, I get only one link when trying the example like this:

    Anemone.crawl("http://www.rubyinside.com") { |a| a.on_every_page{|p| puts p.url} }

    => http://www.rubyinside.com

  5. Harry says:

    Soleone: try add a slash after url, like : "http://www.rubyinside.com/"

  6. Ric Roberts says:

    Try that last example with a trailing slash on the url. Not sure why, but this seems to make a difference. :)

  7. Soleone says:

    Okay, the new version (0.0.6) doesn't have the trailing slash problem anymore, nice!

  8. Carlos Valencia says:

    I like . It is too simple to be true.

  9. Glenn Gillen says:

    Nice.

    If all you're looking for is to take a mirror of a site you can simply do:

    wget -m http://www.rubyinside.com/

    If you just want to spider all your links to make sure nothing is broken:

    wget --spider http://www.rubyinside.com/

    But if you want to do anything more useful, this looks like a pretty simple approach. Will have to give it a look.

Other Posts to Enjoy

Twitter Mentions