Nokogiri: A Faster, Better HTML and XML Parser for Ruby (than Hpricot)
Yesterday, Aaron Patterson (@tenderlove) and Mike Dalessio released Nokogiri (Github repository), a new HTML and XML parser for Ruby. It "parses and searches XML/HTML faster than Hpricot" (Hpricot being the current de facto Ruby HTML parser) and boasts XPath support, CSS3 selector support (a big deal, because CSS3 selectors are mega powerful) and the ability to be used as a "drop in" replacement for Hpricot.
On an Hpricot vs Nokogiri benchmark, Nokogiri clocked in at 7 times faster at initially loading an XML document, 5 times faster at searching for content based on an XPath, and 1.62 times faster at searching for content via a CSS-based search. These are impressive results, since Hpricot was previously considered to be quite speedy itself. (Update - November 3, 2008: WHY FIGHTS BACK! HPRICOT IN PERFORMANCE BUSTING SHOCKER!!)
The code examples provided on the introduction post give you the basic idea, and the library can be installed using
gem install nokogiri (though this didn't work for me on OS X - further instructions below).
Installing on OS X
Note! Developer Aaron Patterson responded to the issues below in an update to the library. Now doing a regular gem install of Nokogiri should work fine. The information below is remaining in place solely for historical / reference purposes.
sudo gem install nokogiri, I encountered multiple problems on OS X. Perhaps it'll work first time for you, but if not, here are some pointers. (Bear in mind, I run the default Ruby that comes with OS X - no special configurations. If you're running Ruby from DarwinPorts, etc, the following might not work at all.)
Trying to install the gem failed after "checking for racc... no". I assumed it was trying to download and install racc by the following line, but it's not. You need to download and install racc yourself. The latest tarball for that is at http://i.loveruby.net/archive/racc/racc-1.4.5-all.tar.gz - download this, and open a Terminal. Continue along these lines:
tar xzvf racc-1.4.5-all.tar.gz cd racc-1.4.5-all sudo ruby setup.rb config sudo ruby setup.rb setup sudo ruby setup.rb install
Trying to install the gem at this point still won't work, as for some reason the racc executable has ended up in
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin rather than
/usr/bin proper. My solution to that was to add that directory to my path in
~/.bash_profile - but you might prefer to symbolically link it. Your choice. If you have no
~/.bash_profile and you're following these instructions blindly, just put this in
PATH=$PATH:/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin export PATH
Next, something called "frex" is also missing. This is more easily installed with gem:
sudo gem install aaronp-frex -s http://gems.github.com
Once this is done, then nokogiri should finally install with gem:
sudo gem install nokogiri
irb and give
require 'nokogiri' a try to make sure.
Please leave any corrections, suggestions, or cries for help in the comments. Thanks!