December 17, 2006

Extending ActiveRecord
Post by Zach Dennis of Continuous Thinking

This day of the Ruby Advent Calendar is by Zach Dennis of Continuous Thinking.

ActiveRecord does alot for you, but it can do more, alot more.

ActiveRecord::Extensions extends ActiveRecord... for you. It gives you features that ActiveRecord itself doesn't give, like:

Using A Hash To Build Query's With Ruby's Builtin Types

class Post < ActiveRecord::Base ; end

Post.find( :all, :conditions=>{ 
  :title => "Title",                           # title='Title'
  :author_contains => "Zach",                  # author like '%Zach%'
  :author_starts_with => "Zach",               # author like 'Zach%'
  :author_ends_with => "Dennis",               # author like '%Zach'
  :published_at => (Date.now-30 .. Date.now),  # published_at BETWEEN xxx AND xxx
  :rating => [ 4, 5, 6 ],                      # rating IN ( 4, 5, 6 )
  :rating_not_in => [ 7, 8, 9 ]                # rating NOT IN( 4, 5, 6 )
  :rating_ne => 4,                             # rating != 4
  :rating_gt => 4,                             # rating > 4
  :rating_lt => 4,                             # rating < 4
  :content => /(a|b|c)/                        # REGEXP '(a|b|c)'
)

# These examples are given for the MySQL adapter. Other adapters are
# are supported, please refer to the ActiveRecord::Extensions 
# documentation for more information.
		

ActiveRecord::Extensions gives you support for Numbers, Strings, Arrays, Ranges and Regexps. It also gives you support for suffix-based query modifiers:

  • _lt for less than
  • _gt for greater than
  • _eq for equals (this is implied if not given)
  • _ne for not equals. This works for Numbers, Strings and Regexps
  • _not_in for NOT IN ( .. ) and NOT BETWEEN ... AND ....

Defining Your Own Custom Query Objects

class InsuranceClaim < ActiveRecord::Base ; end

class InsuranceClaimAgeAndTypeQuery
  def to_sql
     "age_in_days BETWEEN 1 AND 60 AND claim_type IN( 'typea', 'typeb' )"
  end
end

claims = InsuranceClaim.find( :all, InsuranceClaimAgeAndTypeQuery.new )

claims = InsuranceClaim.find( :all, :conditions=>{
  :claim_amount_gt => 30000,
  :age_and_type => InsuranceClaimAgeAndTypeQuery.new } 
)
		

ActiveRecord::Extensions gives you the ability to wrap custom SQL or complex SQL in a object that responds to the method 'to_sql'. Duck typing at its finest. We'll refer to any object that responds to 'to_sql' as the custom query object.

This gives you two forms of 'find' as shown above.

The first form gives you the ability to pass in an custom query object to find itself with no :conditions.

The second form gives you the ability to pass in the custom query object as part of the :conditions Hash. The beauty here is that the Hash key can be anything which makes the code more readable. You can make the key make perfect sense in the context of your application as the custom query object will be used to generate the conditions part of your query.

Importing Mass Amounts Of Data

ActiveRecord executes a single INSERT statement for every call to 'create' and for every call to 'save' on a new model object. When you have only a handful of records to create or save this is not a big deal, but what happens when you have hundreds, thousands or hundreds of thousands?

If you've ever tried to save lots of data using ActiveRecord you've probably noticed that it can take minutes, up to several hours to insert your data.

ActiveRecord::Extensions is solving this issue.

class Student < ActiveRecord::Base ; end

column_names = Student.columns.map{ |column| column.name }
value_sets = some_method_to_load_data_from_csv_file( 'students.csv' )
options = { :valudate => true }

Student.import( column_names, value_sets, options )
		

The import functionality is practically self-explanatory. It takes alot of data and imports it into your database. It's API splits the definition of column names and value sets on purpose.

The array of column names is used so you only have to define once where you want data to go.

The value sets is an array of arrays, further seen in the below example:

column_names = %W( name age )
value_sets = [ [ 'John', 19 ], [ 'Jane', 23 ], [ 'Doug', 21 ] ]
		

The options hash can be used to further optimize. In the first example the options was set to the Hash "{ :validate => true }". By default :validate is true. This tells ActiveRecord extensions to validate all of the data using the validations on your models before inserting the data. It even returns the failed data in an array of model objects!

You'd think this was slow because every value set is validated, but benchmarks have shown that you still get around a 10x speedup when using the 'import' method and enforcing validations. Even though you take a performance hit for performing validation, the performance benefit that you get with the 'import' method makes up for it.

Most of the time when you are mass loading data you aren't necessarily validating it because you're receiving a consistent data feed from a vendor or partner, or you just pulled it out of some other system, so if you set ":validation=>false" you'd get a faster optimization. And this is true. Benchmarks with MySQL's MyISAM and InnoDb table types without validations show up to a 48x performance increase.

To see the benchmarks for MySQL please check out http://www.continuousthinking.com/are/import-benchmarks-for-mysql

Database Adapter Support and Compatibility

Databases like MySQL allow you to use multiple value insert statements to improve INSERT statement efficiency. ActiveRecord::Extensions uses the import method to generate more efficient INSERT statements. ActiveRecord::Extensions even goes as far as asking MySQL what the maximum packet size allowed is. It will generate the most efficient and the least amount of INSERT statements for you.

PostgreSQL has multiple value insert statements as of 8.2 which was just released a week ago, it is coming soon to ActiveRecord::Extensions support.

All other database adapters are supported, because ActiveRecord::Extensions uses a compatible underlying implementation to generate single INSERT statements. If your database adapter isn't supported for more efficient INSERT statements yet you can still use the 'import' method. It will be equivalent (or slightly) faster then using ActiveRecord itself. This gives you the flexibility of upgrading ActiveRecord::Extensions later to receive the benefits of 'import' without having to code any changes.

ActiveRecord Design Considerations

ActiveRecord itself is a modular component. It was designed that way by its creator, DHH. Because of this you can use it in your Rails application or completely without Rails and in some data mining or system administration application or script.

While ActiveRecord maintains this modularity at a high level, it has some core features that would benefit from this modularity at a lower level, which it currently now lacks.

ActiveRecord does a great job of maintaining it's database agnostic implementation, but somethings like finder SQL generation is ingrained into ActiveRecord itself in a way that forces you to only have access to a subset of database features that you will benefit from.

Keep It Componentized, Make It Modular

ActiveRecord itself should and could be more modular. Finder SQL generation can benefit from delegating the SQL generation to other objects which can handle them for database specific implementations.

ActiveRecord::Extensions rewrites how ActiveRecord itself handles SQL generation, by using an extension registry. It implements all find conditions to SQL generation using extensions. This is also what gives ActiveRecord::Extensions the ability to support things like Arrays, Ranges and Regular Expressions.

A finder extension itself only needs to respond to the 'process' method. Here's an example of how Range support is added to ActiveRecord::Extensions:

class RangeExt
  NOT_IN_RGX = /(.+)_(ne|not|not_in)/

  def self.process( key, val, caller )
    if val.is_a?( Range )
      match_data = key.to_s.match( NOT_IN_RGX )
      key = match_data.captures[0] if match_data
      fieldname = caller.connection.quote_column_name( key )
      min = caller.connection.quote( val.first, caller.columns_hash[ key ] )
      max = caller.connection.quote( val.last, caller.columns_hash[ key ] )
      str = "#{caller.table_name}.#{fieldname} #{match_data ? 'NOT ' : '' } BETWEEN #{min} AND #{max}"
      return Result.new( str, nil )
    end
    nil      
  end

end

ActiveRecord::Extensions.register RangeExt, :adapters=>:all  
		

When ActiveRecord comes across a Range object as apart of the :conditions Hash it will look to the ActiveRecord::Extensions registry to find something to process it. The above RangeExt class will answer the request, and it will generate a Result which it passes back to ActiveRecord. And it's done. Your Range has been processed.

When you register an extension you also pass in what adapters it is supported for. It can be single symbol for the name of your adapter (ie: :mysql, :postgresql, etc. ), an array of symbols or the symbol :all to denote all adapters. This allows ActiveRecord to utilize different implementations for the same type of functionality since database vendors don't use the same dialect on everything. An example of this currently is Regular Expression support.

There are a few key points to why this is essential to ActiveRecord as a libray:

One. People don't have to muck around in ActiveRecord overriding methods, adding new methods or anything of the like to add custom query support.

Two. You don't have to understand ActiveRecord internals to extend ActiveRecord and get the functionality you need.

Three. Changes to ActiveRecord itself are more limited in scope. The risk of breaking other parts of ActiveRecord is greatly reduced. For adding finder :conditions support you don't even have to open up any part of active_record code itself, you just code to the extensions API, which consists of defining a method that takes three parameters, and then making a method call to register it.

Four. By keeping ActiveRecord components modular where it makes sense (like SQL generation) you allow for more developers to contribute more robust and feature-ful implementations for the many database adapters, because it's easier to do, and faster to write.

It'd be nice if all database vendors supported the same dialect of SQL, but they don't. You can use delegation to achieve the benefits of different underlying implementations in ActiveRecord while benefiting from a consistent interface to developers of the ActiveRecord library.

You can write a simple extension and then register it. If you wanted to handle any other built-in Ruby type, or if you wanted to handle your own custom class, it can be done easily.

If you think that ActiveRecord would benefit from this type of design and functionality you can:

  • Use ActiveRecord::Extensions
  • Tell others about ActiveRecord::Extensions
  • Join the ActiveRecord::Extensions team! There are several opportunities to help develop, write tests, write documentation and/or design a new web site)

ActiveRecord::Extensions

ActiveRecord::Extensions is a ruby library built to extend ActiveRecord, it can be installed as a ruby library or as a Rails plugin.

Tests

ActiveRecord::Extensions is developed using TDD. It has a whole suite of tests proving it works and acting as superb developer documenation. It includes rake tasks for each supported adapter:

# rake test:<adapter_name> ../path/to/activerecord

rake test:mysql 
		

Compatibility with ActiveRecord

ActiveRecord::Extensions maintains full compatibility with the ActiveRecord test suite. It is so important to maintain compatibility with ActiveRecord that there is a way to test it.

# rake test:activerecord:<adapter_name> ../path/to/activerecord

rake test:activerecord:mysql ../rails_trunk/activerecord
		

API, Design and Adapter Support

ActiveRecord::Extensions has gone through several stages of design and prototyping up to it's 0.0.6 release. Since it maintains full compatibility with ActiveRecord test suites and because of it's modular design all finder and import functionality is compatible with any adapter.

Whether a special feature or optimization is available for your adapter you'll want to check to make sure the adapter is officially supported. For example, MySQL and PostgreSQL have support for regular expression searching. If you use Oracle you won't get this feature but you will get support for Arrays and Ranges.

It's design and API is now quite stable at it's 0.1.0 release. It maintains a low version number because only MySQL and PostgreSQL adapters are supported. As the time of this writing a Firebird adapter is being developed. As more adapters are supported it will increase it's way towards 1.0 status.

ActiveRecord::Extensions is developed using TDD and exploratory SPIKES. The best documentation for it is it's clean and easy to read unit tests.