Tuesday, January 30, 2007

Application Design for PostgreSQL Performance

Query Writing Rules

For all database management systems (DBMSes), "round-trip" time is significant. This is the amount of time which it takes a query to get through the the language parser, the driver, across the network interface, the database parser, the planner, the executor, the parser again, back across the network interface, through the driver data handler, and to the client application. DBMSes vary in the amount of time and CPU they take to process this cycle, and for a variety of reasons PostgreSQL is a the high end of time and system resources per round-trip.

Further, PostgreSQL has significant per-transaction overhead, including log output and visibility rules which need to be set with each transaction. While you may think that you are not using transactions for singleton read-only SELECT statement, in fact every single statement in PostgreSQL is in a transaction. In the absence of an explicit transaction, the statement itself is an implicit transaction.

Offsetting this, PostgreSQL is only barely second to Oracle in processing large complex queries, and has the capability to handle complex multi-statement transactions with overlapping concurrency conflicts with ease. We also support cursors, both scrollable and non-scrollable.

Tip 1: Never use many small selects when one big query could go the job.

It's common in MySQL applications to handle joins in the application code; that is, by querying the ID from the parent record and then looping through the child records with that ID manually. This can result in running hundreds or thousands of queries per user interface screen. Each of these queries carres 2-6 milleseconds of round-trip time, which doesn't seem significant until you add it up for 1000 queries, at which point you're losing 3-5 seconds to round trip time. Comparatively, retrieving all of those records in a single query only takes a few hundred milleseconds, a time savings of 80%.

Tip 2: Group many small UPDATES, INSERTS or DELETEs into large statements, or failing that, large transactions.

First, the lack of subselects in early versions of MySQL has caused application developers to design their data modification statements (DML) in much the same way as joins-in-middleware. This is also a bad approach for PostgreSQL. Instead, you want to take advantage of subselects and joins in your UPDATE, INSERT and DELETE statements to try to modify batches in a single statement. This reduces round-trip time and transaction overhead.

In some cases, however, there is no single query which can write all the rows you want and you have to use a bunch of serial statements. In this case, you want to make sure to wrap your series DML statements in an explicit transaction (e.g. BEGIN; UPDATE; UPDATE; UPDATE; COMMIT;). This reduces transaction overhead and can cut execution time by as much as 50%.

Tip 3: Consider bulk loading instead of serial INSERTS

PostgreSQL provides a bulk loading mechanism called COPY, which takes tab-delimited or CSV input from a file or pipe. Where COPY can be used instead of hundreds or thousands of INSERTS, it can cut execution time by up to 75%.

No comments: