Data wrangling and Ruby metaprogramming

Posted on May 29, 2011

I needed to combine customer data from 2 separate sources (a homebrew warehouse and CRM) and output a report. Given I’d be dealing with similar objects with overlapping attributes, it would be really easy to make a gigantic mess with duplicated code everywhere. Instead, we’ll see how Ruby’s metaprogramming capabilities come to the rescue to help us write DRY code.

The setting

As mentioned in the introduction, we have customer information in a data warehouse (that we access through Ruby’s ActiveRecord ORM) and in SugarCRM (accessed via the SugarCRM ruby gem). (To make things slightly confusing, customer information is stored in the Contact module in SugarCRM.)

The customer class in the data warehouse contains sparse information: first and last names, a customer key (used to identify the customer within the ERP system), and relationships to the customer’s various transactions (in this case, purchases and repairs of physical goods).

The CRM record for a customer will contain contact information (address, email, etc.), information on whether or not the contact information is valid (e.g., the address is actually deliverable), and contact preferences (such as "do not mail"). In case of conflict between the data warehouse and CRM records (which we match via the "customer key"), the CRM record’s information is to be kept, as it is considered to be the most up to date.

Just to keep you on your toes, customers are sometimes entered multiples times within the ERP (due to spelling errors, etc.)—and are therefore present multiple times in the data warehouse—, but will be consolidated in the CRM through a "duplicate keys" field: data for all customers in the "customer key" and "duplicate keys" field will have their information consolidated in CRM. In other words, when searching for a given customer in CRM, both of these fields must be considered.

The goal

We’d like to have an Excel report showing a list of customers who haven’t purchased anything in the last 2 years, and provide all relevant information regarding them: contact info, last purchase, etc.

The code

We’re going to use a class to contain all the information we’re interested in for each customer.That way, we can instantiate a new instance, enrich it with information (coming from the data warehouse and CRM), and then dump it into the Excel sheet.

(You can get the full code here.)

Attributes, send, and respond_to?

The first thing we do is define the attributes we’re interested in:

  
  
    
    ATTRIBUTES = [:abbreviated_name, :key, :first_name, :last_name, ...]
  
    
  

Using an array allows us to control their order (which is useful to display them in a certain order in the report). It also enables us to do several convenient things:

  • iterate on the array to add the column headers in the Excel worksheet (lines 83-87):

      
      
        def add_headers(ws)
      ATTRIBUTES.each_with_index do |a,i|
        ws.Cells(1,i+1).Value = a.to_s
      end
    end
    
      
      
    
  • using a splat to define attribute accessors on all our attributes (line 13):

      
      
        
            attr_accessor *ATTRIBUTES
          
        
      
    
  • enriching our object with only the attributes we’re interested in (line 24):

      
      
        (ATTRIBUTES - [:abbreviated_name, :key]).each{|a|
      send("#{a}=", attributes[a.to_s])
    }
      
      
    

As you see in line 24 (and several other places in the code), Object#send is quite handy: by using dynamic dispatch, we don’t need to repeat the same boilerplate code for each attribute we want to assign.

On line 24, we know that the object has a setter for each attribute, because we’re passing in a list of attributes we’re interested in (and have defined getters and setters via the attr_accessor class method on line 13). But what if we have no idea about the attributes that are coming in?

Ruby provides another method to help out, and we use it on line 18: Object#respond_to? will tell us if the object has such a method. In other words, we invoke the setter method only if it exists. Another place we leverage Object#respond_to? is on line 17, where we determine if the argument given is a list of attributes, or an object.

As you can see on line 30, by carefully naming the object’s attributes, we can once again use dynamic dispatch to enrich our object, and reduce the copy/paste/tweak coding that you might have to deal with in less dynamic languages. On that same line, you can also see that we call the send method on an object different from self to retrieve the information we want, then pass the result as an argument to the send call on self, which sets the value.

Reopening classes

Ruby allows you to reopen classes, which is quite useful. You can see an example starting on line 54 where we reopen an external library to add some convenient methods. I’ll say this again, because it bears repeating: code that was written and packaged by someone else, that I’ve simply required (on line 3), can be reopened to add/change methods. Awesome!

Another case where classes have been reopened (but that isn’t readily apparent) is on line 111. As explained above, multiple customer records can be consolidated into one single CRM record. Naturally, since it’s specific to this particular implementation, functionality to search across the various "customer key" fields isn’t provided by the SugarCRM gem. Instead, I’ve expanded the gem’s functionality by reopening the Contact class and adding the helper method (as explained in the post on advanced gem use).

Where to learn more

If you want to have your mind blown by the possibilities of Ruby metaprogramming and level up your programming wizardry, a good place to start is this book: Metaprogramming Ruby 2: Program Like the Ruby Pros.

Alternatively, if you prefer to study code, there are plenty of good examples on GitHub: ActiveRecord, ActiveSupport, etc. Another example would be the SugarCRM gem: modules, their attributes, and dynamic methods (such as SugarCRM::Account.find_by_name ) are determined/defined dynamically through metaprogramming (and based closely on the ActiveRecord implementation).


Would you like to see more Elixir content like this? Sign up to my mailing list so I can gauge how much interest there is in this type of content.