pyLabs!: Getting random rows from a database table

Getting random rows from a database table

Selecting random rows from a table in your database is generally useful for two things: grabbing one or more rows to display and/or use somehow, and for selecting a random subset of your rows and performing some sort of statistical analysis on the data. While the standard way of using ORDER BY RANDOM() is occassionally useful, it is very slow, it is non-repeatable, and it does not scale well. I'll demonstrate some better methods to get random rows.

For this article, I'll be using a table named mydata which contains ten million rows of data, and a primary key named id, which is of type bigint.

First, it's important to distinguish between random and unordered. If you simply pull rows from your table without an ORDER BY clause, they may appear random, but they are not: they are simply in an undefined order. In PostgreSQL, they will be roughly in an order related to the last time they were updated or inserted. However, by "random" we really mean that any row in the table has as much chance as appearing as another row within our SELECT statement. We'll need some way to accomplish this is SQL.

ORDER BY RANDOM()

How do we get something "random" into our query? Every modern computer has some way of generating a random number, and PostgreSQL has a way as well: the built in RANDOM() function, which generates a double precision number from from 0.0 to 1.0:

SELECT RANDOM() FROM generate_series(1,5);

Running the above yields something like this:

        random
-------------------
739388620946556
706028273329139
150622034911066
396196706686169
412912936415523

(Note the use of the nifty new generate_series() function to repeat a SQL command a certain number of times).

PostgreSQL also allows you to use RANDOM() in the ORDER BY clause, which is one way to get a random row from the database. Let's pull out three random values from our test table:

SELECT id FROM mydata ORDER BY RANDOM() LIMIT 3;

    id
---------
 3635389
 9417084
 8385175

This appears to work just fine, but it has a major drawback - it does not scale, and gets extremely slow as the table size increases. This is a consequence of how ORDER BY RANDOM() works - it basically assigns a random number to every row in the database, then orders the entire table by the random numbers, and then returns the rows you want. For small tables, this is not much of a problem, but this is a terrible solution as the tables grow in size. Here's a breakdown on speeds on my system for grabbing a single row by using the query SELECT id FROM mydata ORDER BY RANDOM() LIMIT 1:

Number of rows	Time to run
One thousand (1000)	3 milliseconds
Ten thousand (10,000)	40 milliseconds
One hundred thousand (100,000)	Half a second
One million (1,000,000)	7 seconds
Ten million (10,000,000)	149 seconds

Fortunately, there are much better ways to obtain random rows. There are two basic approaches to take - we can pick randomly from a range of values, or we can store a random number inside the table itself.

Range of Values

Let's keep using our mydata table, which has a primary key of id. If we know enough information about a column in the database, we can use that to get random rows by picking random values of that column. In this example, all we need to know is the minimum and maximum value of the id column, and we can have an external program generate a random number between the minimum and the maximum and put it into a query:

SELECT * FROM mydata WHERE id = '4012341';

We can also have the database help us choose the number, if we know there are a maximum of 10 million ids:

SELECT * FROM mydata
WHERE id =
  (SELECT (RANDOM() * 10000000)::int OFFSET 0)
LIMIT 1;

Both run in under a second, as the primary key column id is indexed. (The use of OFFSET 0 is needed in the second query to force the planner to evaluate RANDOM() only one time).

There are a few problems with this approach, however. One obvious one is that the query above may fail if there are any "holes" in the range of numbers from min to max. Storing information about where the holes are is probably impractical, but we can get around it by finding the value that is closest to the random number we picked, like this:

SELECT * FROM mydata
WHERE id >= '4012341'
LIMIT 1;

While that query addresses the problem of holes, it has two additional problems: it does not guarantee that the same row is returned each time, and it sometimes runs very, very slow. Running an EXPLAIN plan shows us why the speed difference:

         QUERY PLAN
----------------------------------------
Limit  (cost=0.00..0.02 rows=1)
  ->  Seq Scan on mydata 
   (cost=0.00..223040.00 rows=9996117)
        Filter: (id >= 4012341::bigint)

The index is not being used. As a good rule of thumb, never use a LIMIT without an ORDER BY clause. Let's add one in, which will solve both of our problems. The index will be used, and the results will be predictable:

SELECT * FROM mydata
WHERE id >= '4012341'
ORDER BY id LIMIT 1;

This strategy of using ">= (value) ORDER BY (column) LIMIT 1" is one which we will us a lot from this point forward.

Another problem is that we are not guaranteed to get the number of rows that we want. For example:

SELECT * FROM mydata
WHERE id >= '9999999'
ORDER BY id LIMIT 100;

This will only return 2 rows since our sample data has a maximum id of ten million. There are two ways around this problem: you can re-run the query with a new random number until you get the number of random rows you need, or you can adjust your random number (or your table) to make sure that you always have at least that many. For example, if your data has 100 rows and you want to pull 10 of them at random, then make sure you never ask for an id of more than 90. Alternatively, you could "pad" your table with 10 extra rows, and then safely use the numbers 1-100.

There is one final problem: picking our own random values from range will not produce truly random rows unless the data is perfectly uniformly distributed. Consider a table with two rows and values of 1 and 10. Our strategy above would cause the 10 value to appear more often than the 1 value, which is not the randomness we are looking for. In addition to the holes, if the values are not unique, then the distribution may not be uniform, and we once again lack true randomness. We need a way to combine the true randomness of ORDER BY RANDOM() with the speed of a Range of Values.

Random Column

The final and best solution is to create a new column in your database that stores random values. The table can then be sorted by this column, and get back random rows in a fast, repeatable, and truly random way. What we are basically doing is emulating the effect of ORDER BY RANDOM(), which as you recall creates a random value for each row in the database. Let's apply it to our test table.

First, we create a new column to hold the random values. Since RANDOM() returns the type "double precision", we create a new column of that type. We'll name it myrand:

ALTER TABLE mydata ADD myrand DOUBLE PRECISION;

Now we can populate that row with a random number from 0.0 to 1.0:

BEGIN;
UPDATE mydata SET myrand = RANDOM();
COMMIT;

This does take a non-trivial amount of time to run (372 seconds to populate all ten million rows), but it is a one-time cost. Since we'll be hitting this column to generate our random rows, we should put an index on it as well. But before we do that, we have to also ensure that our results are reproducible. In other words, the same query should return the same exact rows. Something like this is not guaranteed to get the same 10 rows each time it is run:

SELECT * FROM mydata
ORDER BY myrand LIMIT 10;

Why? Because there is no unique constraint on the myrand column, and it is possible (especially with our 10 million row example table) that two myrand columns contain the same value. As another rule of thumb, always make sure your ORDER BY clause specifies a unique set of rows. Our primary key, "id", is unique, so that makes a good backup for when our myrand column happens to have the same value. Our new query becomes:

SELECT * FROM mydata
ORDER BY myrand, id LIMIT 10;

Now we can create the index, on both of the columns in that ORDER BY. For good measure, we'll analyze the table as well:

CREATE INDEX myrand_randomhelp ON mydata(myrand,id);

ANALYZE VERBOSE mydata;

Before the index was in place, the query to grab a random row took over 180 seconds. Now that it is in place, the query runs in less than 1 second (126 milliseconds).

So that's our basic "Random Column" strategy: assign each row a random number, make sure it is linked to another unique column, and make an index across both of them. This allows us to get fast, repeatable, and truly random rows. You can also ensure that new rows get a new random value automatically added to them by doing this:

ALTER TABLE mydata ALTER myrand
SET NOT NULL DEFAULT RANDOM();

If you don't care about repeatability, and simply want to grab a random row, you can do this:

SELECT * FROM mydata
WHERE myrand >= (SELECT RANDOM() OFFSET 0)
ORDER BY myrand ASC LIMIT 1;

The ORDER BY clause is needed to ensure that our index is used. Note that Postgres has no problem using our previous index we created on both columns, because we put the myrand column first inside of that index. The above query is basically what Wikipedia uses when you click on the "Random Page" link.

Another advantage to using a Random Column is that not are the results reproducible, they are resettable. Let's say that you are using this method to pull 100 random rows at time out of a table with 1000 rows for statistical analysis. You also want to make sure that you never use the same row more than once, so you use an OFFSET:

SELECT * FROM mydata ORDER BY myrand, id
  LIMIT 100 OFFSET 1;
SELECT * FROM mydata ORDER BY myrand, id
  LIMIT 100 OFFSET 100;
SELECT * FROM mydata ORDER BY myrand, id
  LIMIT 100 OFFSET 200;
etc... 

(Note: although offset starts at 0, we ignore the first column as OFFSET 100 is easier to read then OFFSET 99). At some point, you want to run some more tests, but you don't want the same grouping as before. In other words, you want to reshuffle the deck of cards. Simple enough, just assign new values to the 'myrand' column:

UPDATE mydata SET myrand = RANDOM();

The only drawback to the whole Random Column strategy is the time and effort it takes to set it up, and the additional disk space needed to handle the extra column. Because of the extra column, INSERTS and UPDATES may run slightly slower.

Here's a summary of the three strategies to grab some random rows from a table:

Technique	Pros	Cons	When to use
ORDER BY RANDOM()	Completely random No table changes needed Easy to append to existing queries	Very slow: does not scale Non-repeatable result sets	Quick ad-hoc queries and very small tables that will not grow large
Range of Values	Very fast Uses existing columns	Not truly random Must track minimum and maximum Hard to get desired number of rows Inserts can affect results	When data is very well-defined and stable (even then, be cautious)
Random Column	Very fast Truly random Reproducible Automatically maintained Resettable Column data type matches random() for easy use	Takes up disk space Initial setup cost Inserts may be slowed by the extra column and the default value	Whenever possible

pyLabs!

Saturday, December 30, 2006

Getting random rows from a database table

Getting random rows from a database table

ORDER BY RANDOM()

Range of Values

Random Column

No comments:

Recommended Sites

Labels

Blog Archive

del.icio.us

bookmarks

pylabs addicts