postgresql has grown to the most beloved database in the rails community. so as more and more rails developers start using it the more common mistakes get made.
today i want to give some general advice to follow when using postgresql in a rails app.
why should you read this?
your db has grown from a handful of entries to tables holding 10M+ rows?
somehow it’s just not as fast as it’s used to be?
some queries run for seconds without any apparent reason?
you never bother to get your head around all this database stuff?
well…it’s about time. let’s jump right in:
1. use indices
why?
if you are not using indices you are doing it wrong…very wrong. if your app is still running without indices it’s just because no one has bothered using it so far, not even you.
what does it do?
an index over a column tells the database that you want to be able to select rows with this attribute in a quick way. so the index of a column gets saved to the RAM in a special sorted order and can be accessed blazing fast.
how do i use it?
indices are simply created inside your migrations.
add_index(:table_name, :column_name, :unique => false)
if it’s not a unique value like a 1:n foreign key
or
add_index(:table_name, :column_name, :unique => true)
if it’s smth. unique like a token
when do i use it?
as a rule of thumb you create an index over any attribute you want to be able to search for!
also don’t forget to index all the foreign keys, if it end with ‘_id’ you should index it.
and if you want to order by an integer/string/boolean/datetime column…you know…index.
how much faster is it?
as an example i will use a table with circa 3M entries on a very fast database server (like really fast).
QUERY PLAN ------------------------------------------------------------------------------------------------------------- Seq Scan on pictures (cost=0.00..384380.10 rows=11 width=268) (actual time=4.408..1126.319 rows=4 loops=1) Filter: (upload_file_size = 5536) Total runtime: 1126.339 ms (3 rows)
1,126 ms for a single simple select. let’s try again using an index:
QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Index Scan using index_pictures_on_views on pictures (cost=0.00..1400.81 rows=326 width=268) (actual time=0.011..0.012 rows=1 loops=1) Index Cond: (views = 5536) Total runtime: 0.029 ms (3 rows)
0.029 ms … that’s almost 40,000 times faster…yeah you should probably start using indices now.
how i get these pretty outputs and what they mean will be revealed next.
2. use ‘explain’
why?
understandig what your database does when you ask it something is key. if you know why something takes forever you can change it. so instead of just guessing what’s wrong, you’ll know.
what does it do?
the explain command of postgresql provides a way to look deep inside the workings of the database server. it gives you all the information you need to understand how a query gets executed.
it displays which indexes and operations were used and how long they took. just scroll up and you can find two examples.
should you be afraid using it?
definitively no! ‘explain’ is your friend, even if you have to use that psql console you usually try to avoid.
how do i use it?
a full foray into everything there is to know about ‘explain’ would be a bit too much here. but i will provide the documentation.
so what do you need? first you have to open a psql console to a database (similar to)/or your production server using the psql
command. if you get something about denied access try psql -h hostname
. still no access? try using the username and the password from the config/database.yml
now that we have the console open we should find a sql query we want to take a closer look at. you can probably find it in your log/development.log
.
ok we have query and a console, what now? there are two ways to use ‘explain’ with or without ‘analyze’.
the second actually executes the query and measures the time needed. the first option just gives you an educated guess. of course we really want to know, so:
explain analyze YOUR QUERY GOES HERE;
so for example:
explain analyze select * from pictures where account_id = 24;
if your copy ‘n paste skills didn’t fail you, you’ll get something like this:
QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using index_pictures_on_account_id on pictures (cost=0.00..242.44 rows=56 width=268) (actual time=0.021..0.061 rows=40 loops=1) Index Cond: (account_id = 24) Total runtime: 0.078 ms (3 rows)
so what does this tell us? in this example you can see, that an index on the ‘account_id’ was used to scan over the rows of the table. and that this was pretty quick.
let’s see something more complex:
explain analyze SELECT * FROM "pictures" WHERE ("pictures"."deleted" = 'f') AND ("pictures".account_id = 22); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using index_pictures_on_account_id on pictures (cost=0.00..242.44 rows=41 width=268) (actual time=0.032..0.120 rows=16 loops=1) Index Cond: (account_id = 22) Filter: (NOT deleted) Total runtime: 0.146 ms (4 rows)
as you can see first an index on account_id was used and then a simple filter to get those not deleted.
of course we can only begin to scratch the surface here, but i will give you one more example.
let’s look at the select without index again:
QUERY PLAN ------------------------------------------------------------------------------------------------------------- Seq Scan on pictures (cost=0.00..384380.10 rows=11 width=268) (actual time=4.408..1126.319 rows=4 loops=1) Filter: (upload_file_size = 5536) Total runtime: 1126.339 ms (3 rows)
by now you should be able to understand what’s happening here. a sequential scan over all rows is executed looking for a match of our attribute. this as you can imagine (and see) is extremely slow.
so whenever you see something like this going on in your query you have found a problem.
3. avoid NULL LAST when sorting
what am i talking about?
you know the problem: at some point in your projects life you add a column to a model you later want to sort by. and even though you added a default value for it all the preexisting entries have NULL as the value of this column.
so you go ahead and write something like:
@pictures = Picture.all(:limit => params[:limit], :offset => params[:offset], :order => "score DESC NULLS LAST")
in order not to have all those stupid NULLS on top of your results.
if you recognize this, you’re doing wrong!
why is it wrong?
short answer: it’s slow as hell.
long answer:
explain analyze select * from pictures order by views asc NULLS LAST limit 50 offset 100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=137.44..206.15 rows=50 width=646) (actual time=3.084..3.183 rows=50 loops=1) -> Index Scan using index_pictures_on_views on pictures (cost=0.00..680358.59 rows=495040 width=646) (actual time=1.692..3.170 rows=150 loops=1) Total runtime: 3.538 ms (3 rows)
vs.
explain analyze select * from pictures order by views asc limit 50 offset 100; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=137.44..206.15 rows=50 width=646) (actual time=0.048..0.068 rows=50 loops=1) -> Index Scan using index_pictures_on_views on pictures (cost=0.00..680358.59 rows=495040 width=646) (actual time=0.012..0.057 rows=150 loops=1) Total runtime: 0.093 ms (3 rows)
yes…that’s 40 times faster…the more NULLS the bigger this difference will be.
how to avoid this?
first of all you should always define a default value for all the attributes you want to be able to sort by.
class CreatePictures < ActiveRecord::Migration def self.up create_table :pictures do |t| t.integer :account_id t.integer :views, :default => 0 t.integer :score, :default => 0 t.timestamps end end def self.down drop_table :pictures end end
but what if you add the attribute later? you should update all the other rows containing NULL to the default value.
so you could just:
class AddLikesToPicture < ActiveRecord::Migration def self.up add_column :pictures, :likes, :integer, :default => 0 add_index(:pictures, :likes, :unique => false) Picture.update_all('likes = 0') #warning! this will lock your table! end def self.down remove_column :pictures, :likes end end
but…there is a rather big caveat here. if your table is big (1M+ entries) this will lock up your table and block any access, rendering your app stuck till it’s finished.
the problem is, that migrations are run inside a transaction, which is usually a good idea. but in this case we need to work around this limitation.
depending how big your table is you either need to define a psql function an execute it for chunks of the entries or create a rake task doing the same thing in ruby.
since the first way is a bit out of scope, i’ll show the second one.
this should be able to work up to 10M entries.
lib/tasks/set_default_values_to_model.rake
task "set_default_values_to_model" do require "./config/environment" Picture.find_each(:select => 'id', :batch_size => 10000 ) do |pics| pics.update_all('likes = 0') end end
so now you know how to avoid NULL when sorting ;)
4. avoid sorting by more then one criteria (or use a special index)
why shouldn’t i do it?
as you may have guessed it already…it’s slow.
let me show you:
explain analyze select * from pictures order by views asc limit 10; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..13.74 rows=10 width=646) (actual time=0.063..0.071 rows=10 loops=1) -> Index Scan using index_pictures_on_views on pictures (cost=0.00..680358.59 rows=495040 width=646) (actual time=0.062..0.070 rows=10 loops=1) Total runtime: 0.108 ms (3 rows)
vs.
explain analyze select * from pictures order by views asc, score asc limit 10; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit (cost=36710.04..36710.06 rows=10 width=646) (actual time=919.725..919.728 rows=10 loops=1) -> Sort (cost=36710.04..37947.64 rows=495040 width=646) (actual time=919.724..919.727 rows=10 loops=1) Sort Key: views, score Sort Method: top-N heapsort Memory: 27kB -> Seq Scan on pictures (cost=0.00..26012.40 rows=495040 width=646) (actual time=505.783..830.645 rows=495040 loops=1) Total runtime: 921.776 ms (6 rows)
yes…that much slower…really.
is there another way?
yes, as frank pointed out in the comments, our old friend index will help here too.
so we define an index across the two fields in a migration.
add_index(:pictures, [:views, :score], :unique => false)
and see how much faster this is:
explain analyze select * from pictures order by views asc, score asc limit 10; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..11.75 rows=10 width=646) (actual time=0.020..0.028 rows=10 loops=1) -> Index Scan using index_pictures_on_views_and_score on pictures (cost=0.00..581644.49 rows=495040 width=646) (actual time=0.019..0.026 rows=10 loops=1) Total runtime: 0.049 ms (3 rows)
as you can see, if you want to sort by more than one criteria, use an index across all the columns in question.
thanks frank for the good idea.
summary
i hope you could learn something new and i got you a little more in touch with your database server.
of course there is much more to learn if you want maximum performance from your postgresql, but this is for another blog post.
so far,
have fun.
Create a single index on “views” and “score”, and sorting by both criteria will be much faster.
add_index(:pictures, [:views, :score], :unique => false)
great idea i will add it to the article ;)
Thanks for explanation.