I’ve been working on integrating Rails and datatables, following a range of very good tutorials on the web, the best of which I think is the railscast on the topic.
An area of discussion has been whether to go with server-side or client-side processing. In a separate post I’ll talk about that, but for now we’re going client-side. That means that we need Rails configured to push a reasonable number of rows (say 1,000-10,000) down to the browser reasonably quickly. This post discusses some work I’ve done to try to achieve that, with a sample table + 5 joins to subsidiary tables, I’ve moved from 14s to 1.4s, which I think is a worthwhile improvement.
Start off with some caveats, as you always should in any performance tuning discussion. The results here in absolute terms are probably not correct – a fully configured production environment would likely be faster in some ways, and slower in other ways. Key areas of difference are likely to be:
- My server has abysmal IO performance, and a relatively slow processor that is then further slowed by being virtualised
- My database is very small, so it’s likely the entire query is being served from cache. I could work out how to clear the cache between runs, but haven’t
- My database is very small so it’s likely that even when we go to disk it takes only a few physical reads to get the data, whereas (given my database architecture at the moment) in real life the data would be sprinkled all over the place (which is also a pointer to a future performance tuning thing I probably need to do, another post perhaps)
- The development configuration of rails writes a lot of logging and runs a lot of JIT compiling. The production configuration should be significantly faster
Bottom line here, I think the relativity is correct, the absolute results are probably not. But 10x faster is likely to stay around 10x faster even with those other issues.
Let’s next talk about my data model and query. I created a set of 5 tables, basically a transactional data table with associations to reference data; create and update user associations, and associations to two reference data tables. I’m assuming that we’re a bookstore and selling books, since the great pragmatic programmer series uses a bookstore as their example:
- Books – a table that holds a list of books
- Users – the books table has a link to create_user and update_user, each book has 1 create user and 1 update user
- Book_types – each book has a type (for argument’s sake, fiction or non-fiction)
- Book_weights – each book has a weight category (for argument’s sake, heavy, medium and light, and maybe we’re using this for shipping – don’t ask why we wouldn’t store the weight itself, this is an example)
We create these tables, including some data in the reference data tables. We then load 1000 rows to the books table, and we run the standard rails index page, with the exception that we’ve configured each of the rows to show the linked reference data title.
So, for example, in the controller we’re getting all the books:
@books = Book.all
And in the index view we’re getting the decode of the book type:
<td><%= book.book_type.name %></td>
What we see in the development logs is that Rails goes to the database first for all the books – we see:
select * from books;
We then see it go back to the database 1000 times for each table – basically each time it renders the data from the subtable:
select * from book_type where book_type.id = 1;
This is expected behaviour when using lazy loading. It is brilliant if you only rarely get those sub-entities, but the performance isn’t flash if you get all of them. What we really want to do is to get all the book_type data up-front, so we want a query that pushes all the work onto the database:
select * from books, book_types where books.book_type_id = book_types.id
So, we did some web research, finding some very good rails documentation. This problem is known as the N+1 query problem.
The first step is to try eager loading. Eager loading is a configuration designed for exactly this – it tells Rails that you want specific subtables loaded at initial query time, as you expect to use the data. The syntax for this is a little different in the controller:
@books = Book.all.includes(:book_type, :book_weight, :create_user, :update_user)
This time, when we review the logs we see some improvement. The SQL executed now looks like:
select * from books; select * from book_types where id in (1, 2); select * from book_weights where id in (2, 3); select * from users where id in (1, 2, 3, 4);
My interpretation is that rails has gotten all the books, then looked through the list to see what book_types are used, then gone to the database to get those 2 book types, then gone back through the @books list and added them in, and so on and so forth for the other tables.
The database time taken in this process is way down – the queries are very fast. But there seems to be a lot of compute time disappearing, as it still takes 14s to process. Perhaps there’s an option to profile and work out what it is, but my conclusion is that I probably don’t want to do it this way.
The next possibility, again from the same rails documentation, is to use joins. As documented joins will do exactly what I want – it will retrieve everything in one query. So we change our syntax as follows:
@books = Book.all.joins(:book_type, :book_weight, :create_user, :update_user)
The problem with this is that the joins clause, using relationships like this, will only generate inner joins. If any of your columns are optional then this doesn’t work (as it drops rows). This is particularly true when your test data actually had update_user null on every row – you end up with a nil result set!!
So, we use the slightly different syntax for outer joins. We also note that columns that refer to the same table (e.g. we refer to users twice) need a little special treatment – aliasing the table so as not to confuse the SQL.
@books = Books.all.joins(:book_type, 'LEFT OUTER JOIN book_weights on books.book_weight_id = book_weights.id', 'LEFT OUTER JOIN users as create_user on books.create_user_id = create_user.id', 'LEFT OUTER JOIN users as update_user on books.update_user_id = update_user.id')
This now gives us the query we want – rails issues the initial query as:
select * from books INNER JOIN book_types on books.book_type_id = book_types.id LEFT OUTER JOIN book_weights on books.book_weight_id = book_weights.id LEFT OUTER JOIN users as create_user on books.create_user_id = create_user.id LEFT OUTER JOIN users as update_user on books.update_user_id = update_user.id;
And this query runs faster – a few hundred milliseconds. The problem is that Rails still goes off and gets the sub-tables separately – so we’re still running another 4,000 queries. Which isn’t really what we had in mind.
The next step is to read the rails documentation some more, and to decide that we don’t actually need full entities for those sub-tables, we’re only after the decode column. If we alias that column into our model, then we can get at it. So we make two changes. Firstly, in the controller, we extend to:
@books = Books.all.joins(:book_type, 'LEFT OUTER JOIN book_weights on books.book_weight_id = book_weights.id', 'LEFT OUTER JOIN users as create_user on books.create_user_id = create_user.id', 'LEFT OUTER JOIN users as update_user on books.update_user_id = update_user.id')\ .select('books.*, book_types.name as book_type_name, book_weights.name as book_weight_name, create_user.name as create_user_name, update_user.name as update_user_name')
Then, in the index.html.erb, we change the format of our variables to use this new aliased column, rather than referencing the entire object (note the subtle change of ‘.’ to ‘_’)
<td><%= book.book_type_name %></td>
Upshot of all that – the extra queries are gone, the whole thing returns in about 1.4s, which is pretty good.
Later posts will discuss how you get this into a datatable.