Ruby Weekly is a weekly newsletter covering the latest Ruby and Rails news.

Nokogiri: A Faster, Better HTML and XML Parser for Ruby (than Hpricot)

By Peter Cooper / October 31, 2008

html-xml.pngYesterday, Aaron Patterson (@tenderlove) and Mike Dalessio released Nokogiri (Github repository), a new HTML and XML parser for Ruby. It "parses and searches XML/HTML faster than Hpricot" (Hpricot being the current de facto Ruby HTML parser) and boasts XPath support, CSS3 selector support (a big deal, because CSS3 selectors are mega powerful) and the ability to be used as a "drop in" replacement for Hpricot.

On an Hpricot vs Nokogiri benchmark, Nokogiri clocked in at 7 times faster at initially loading an XML document, 5 times faster at searching for content based on an XPath, and 1.62 times faster at searching for content via a CSS-based search. These are impressive results, since Hpricot was previously considered to be quite speedy itself. (Update - November 3, 2008: WHY FIGHTS BACK! HPRICOT IN PERFORMANCE BUSTING SHOCKER!!)

The code examples provided on the introduction post give you the basic idea, and the library can be installed using gem install nokogiri (though this didn't work for me on OS X - further instructions below).

Installing on OS X

Note! Developer Aaron Patterson responded to the issues below in an update to the library. Now doing a regular gem install of Nokogiri should work fine. The information below is remaining in place solely for historical / reference purposes.

Upon trying sudo gem install nokogiri, I encountered multiple problems on OS X. Perhaps it'll work first time for you, but if not, here are some pointers. (Bear in mind, I run the default Ruby that comes with OS X - no special configurations. If you're running Ruby from DarwinPorts, etc, the following might not work at all.)

Trying to install the gem failed after "checking for racc... no". I assumed it was trying to download and install racc by the following line, but it's not. You need to download and install racc yourself. The latest tarball for that is at http://i.loveruby.net/archive/racc/racc-1.4.5-all.tar.gz - download this, and open a Terminal. Continue along these lines:

tar xzvf racc-1.4.5-all.tar.gz
cd racc-1.4.5-all
sudo ruby setup.rb config
sudo ruby setup.rb setup
sudo ruby setup.rb install

Trying to install the gem at this point still won't work, as for some reason the racc executable has ended up in /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin rather than /usr/bin proper. My solution to that was to add that directory to my path in ~/.bash_profile - but you might prefer to symbolically link it. Your choice. If you have no ~/.bash_profile and you're following these instructions blindly, just put this in ~/.bash_profile:

PATH=$PATH:/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin
export PATH

Next, something called "frex" is also missing. This is more easily installed with gem:

sudo gem install aaronp-frex -s http://gems.github.com

Once this is done, then nokogiri should finally install with gem:

sudo gem install nokogiri

Run up irb and give require 'nokogiri' a try to make sure.

Please leave any corrections, suggestions, or cries for help in the comments. Thanks!

Comments

  1. Caius Durling says:

    Indeed it won't install racc for you, the error message (rather cryptically) is telling you to install it manually.

    checking for racc... no
    need racc, get the tarball from http://i.loveruby.net/archive/racc/racc-1.4.5-all.tar.gz

  2. Peter Cooper says:

    Yeah, I saw that. I was assuming the following compilation error was it trying to install Racc. Usually if a gem has dependencies that fail, that's the end of it.. but in this case it carries on and tries to compile itself without all the dependencies being satisified - which is rather odd.

  3. Yehuda Katz says:

    Kudos to AP and Mike for finally releasing this. Merb is going to be making use of it for speedier test helpers and more compliant CSS3.

    Between nokogiri and webrat, Merb tests in 1.0 are going to be a world better.

  4. Martijn says:

    Looks goot, but Hpricot runs on JRuby!

  5. Aaron Patterson says:

    If the gem doesn't install with a vanilla OS X, please let me know. It is a bug, and I will fix it.

    I'm not down with making the installation so complex. Not to mention, I consider the code on github to be unstable.

  6. Peter Cooper says:

    It installs, once the dependencies are resolved.

    I believe my OS X and Ruby install to be reasonably vanilla. I do have a stackload of gems installed, but I'm running the regular OS X supplied Ruby and RubyGems otherwise. I'll give it a whirl on my newish MacBook Pro that I don't really use for Ruby dev..

  7. Peter Cooper says:

    On the MBP now - getting a different error on here.

    ..

    Building native extensions. This could take a while...
    ERROR: Error installing nokogiri:
    ERROR: Failed to build gem native extension.

    rake RUBYARCHDIR=/Library/Ruby/Gems/1.8/gems/nokogiri-1.0.1/lib RUBYLIBDIR=/Library/Ruby/Gems/1.8/gems/nokogiri-1.0.1/lib
    (in /Library/Ruby/Gems/1.8/gems/nokogiri-1.0.1)
    rake aborted!
    undefined method `add_development_dependency' for #
    /Library/Ruby/Gems/1.8/gems/nokogiri-1.0.1/rakefile:19:in `new'

    ..

    I think this error is probably because the default version of RubyGems on OS X is still 1.0.1, whereas on my other machine I'm running 1.2.0.

    I've just run the update for RubyGems, and it's now at 1.3.1. gem install nokogiri now gives me the same error as it did on the other machine:

    ..

    checking for racc... no
    need racc, get the tarball from http://i.loveruby.net/archive/racc/racc-1.4.5-all.tar.gz
    *** extconf.rb failed ***

    ..

    So - yeah - it's just depedencies. Once Racc and Frex are installed, it should be fine.

    The only way it could be more seamless is if Racc was gemified and included as a gem dependency.. and if Frex was also a gem dependency, so that gem would install them both automatically.

  8. Aaron Patterson says:

    Thanks Peter. Actually, neither of those should be dependencies. They are build time dependencies and not runtime dependencies. I've found the problem and @jbarnette is fixing it.

  9. Aaron Patterson says:

    Okay. A new gem is pushed. Once the gem index refreshes, you should be able to install version 1.0.2 without any dependencies.

  10. Michael Risser says:

    An additional error happened when installing the gem with MacPorts Ruby 1.8.6:

    Building native extensions. This could take a while...
    ERROR: Error installing nokogiri:
    ERROR: Failed to build gem native extension.

    rake RUBYARCHDIR=/opt/local/lib/ruby/gems/1.8/gems/nokogiri-1.0.1/lib RUBYLIBDIR=/opt/local/lib/ruby/gems/1.8/gems/nokogiri-1.0.1/lib
    rake aborted!
    no such file to load -- hoe
    /opt/local/lib/ruby/gems/1.8/gems/nokogiri-1.0.1/rakefile:4
    (See full trace by running task with --trace)
    (in /opt/local/lib/ruby/gems/1.8/gems/nokogiri-1.0.1)

    This was easily fixed by doing:
    sudo gem install hoe

  11. Michael Risser says:

    Also for the Newbies (like me) out there, when you run up irb to test the install, before requiring nokogiri do 'require "rubygems"'. This one ALWAYS trips me up :-)

  12. Kevin Marsh says:

    FYI: Tried a simple gem install nokogiri at 4:50 EDT on All Hallows Eve, installed and runs flawlessly on my Mac Book Pro running stock Ruby 1.8.

  13. Peter Cooper says:

    Awesome, Kevin. I've added a note to the post to indicate that my instructions are now obsolete.

  14. Pistos says:

    I should point out that I had to upgrade hoe to version 1.8.2 before the nokogiri gem installation proceeded.

  15. Dong Zhang says:

    I am on windows xp, got this error

    D:\Documents and Settings\dzhang2>gem install nokogiri
    Bulk updating Gem source index for: http://gems.rubyforge.org/
    Building native extensions. This could take a while...
    ERROR: Error installing nokogiri:
    ERROR: Failed to build gem native extension.

    rake RUBYARCHDIR=c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/lib RUBYLIBDIR=c:
    /ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/lib
    (in c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2)
    rake aborted!
    couldn't find HOME environment -- expanding `~/.hoerc'
    c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/rakefile:20:in `new'
    (See full trace by running task with --trace)

    Gem files will remain installed in c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2
    for inspection.
    Results logged to c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/gem_make.out

    *********************************************

    anybody got it installed on windows?

    thanks

  16. aphe says:

    what the right way to use Nokogiri::XML with
    namespaces?

    is necessary to register namespaces as libxml-rb does?
    http://libxml.rubyforge.org/rdoc/classes/LibXML/XML/XPath.html

    bye

  17. Mike Dalessio says:

    Hi Aphe,

    For handling XML namespaces, Aaron and I tried to make it a little simpler than the libxml-style namespace registration.

    You should be able to make a query like:

    xml = Nokogiri::XML.parse(...)
    tires = xml.xpath('//bike:tire', {'bike' => 'http://schwinn.com/'})

    more generally, the xpath() method takes an optional second argument which is a hash of namespace-alias => URL.

    You can take a look at some of the test cases for more details. We're working on more complete documentation!

  18. Mike Dalessio says:

    @Dong,

    You should be able to avoid that (common) hoe error message by setting a phone HOME environment variable.

    Try running:

    set HOME=foo

    before installing!

  19. Dong Zhang says:

    Mike

    thanks for the tip. now, I am getting a different error

    D:\Documents and Settings\dzhang2>gem install nokogiri
    Bulk updating Gem source index for: http://gems.rubyforge.org/
    Building native extensions. This could take a while...
    ERROR: Error installing nokogiri:
    ERROR: Failed to build gem native extension.

    rake RUBYARCHDIR=c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/lib RUBYLIBDIR=c:
    /ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/lib
    (in c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2)
    rake aborted!
    undefined method `add_development_dependency' for #
    c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/rakefile:20:in `new'
    (See full trace by running task with --trace)

    Gem files will remain installed in c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2
    for inspection.
    Results logged to c:/ruby/lib/ruby/gems/1.8/gems/nokogiri-1.0.2/gem_make.out

    thanks
    Dong

  20. Jamie McLaughlin says:

    Dong:

    It looks like your rubygems is not up to date. I had the same problem, so I downloaded the latest version of rubygems from rubyforge and installed it.

    You can try:

    gem upgrade --system

    I have Ubuntu, so that doesn't work for me.

  21. SeanJA says:

    Oh dear, your info is out of date already... hpricot is now faster...

    http://hackety.org/2008/11/03/hpricotStrikesBack.html

  22. Dong Zhang says:

    Mike

    that is it! after update my rubygems, the install went through.

    thanks
    Dong

  23. Dong Zhang says:

    sorry, previous message should be to Jamie.

    Jamie, appreciate your help.

  24. Lawrence says:

    Nokogiri certainly is not better at "it just installs!". On a clean ubuntu 8.10 install hpricot installs just fine, while nokogiri installs with loads of issues, see above, and mine is different:

    $ sudo gem install nokogiri
    Building native extensions. This could take a while...
    ERROR: Error installing nokogiri:
    ERROR: Failed to build gem native extension.

    rake RUBYARCHDIR=/usr/lib/ruby/gems/1.8/gems/nokogiri-1.0.3/lib RUBYLIBDIR=/usr/lib/ruby/gems/1.8/gems/nokogiri-1.0.3/lib
    (in /usr/lib/ruby/gems/1.8/gems/nokogiri-1.0.3)
    /usr/lib/ruby/gems/1.8/gems/rake-0.8.3/lib/rake/gempackagetask.rb:13:Warning: Gem::manage_gems is deprecated and will be removed on or after March 2009.
    checking for xmlParseDoc() in -lxml2... no
    checking for xsltParseStylesheetDoc() in -lxslt... no
    checking for libxml/xmlversion.h in /usr/include/libxml2,/usr/include/libxml2... no
    need libxml
    *** extconf.rb failed ***
    Could not create Makefile due to some reason, probably lack of
    necessary libraries and/or headers.

    Installed libxml and libxml2, still getting same error.

    Nokogiri: big fail !

  25. Si says:

    Plans for JRuby support? Looks like this will cause problems for Webrat, who just switched to Nokogiri.

  26. Jamie McLaughlin says:

    Lawrence - Try to install the libxml-dev and libxml2-dev packages. That way the header files are available.

    checking for libxml/xmlversion.h in /usr/include/libxml2,/usr/include/libxml2... no

    Seems to point to a missing header.

Other Posts to Enjoy

Twitter Mentions