pyLabs!: September 2006

Friday, September 29, 2006

How to disable system beep in windows

Wednesday, September 27, 2006

Changing your default shell

One of the neat things about *nix is the ability to work with many different shells. As with anything else in the *nix world there is bound to be heated debates over which shell is better. Whether you want to use good ol' sh or (my personal favorite) bash you can change your default shell using the commands below.

Do a cat on /etc/shells. The output will show you the location of the binaries for all the shells installed on your system. IT IS EXTREMELY IMPORTANT THAT YOU KNOW EXACTLY WHERE THE SHELL YOU WANT TO USE IS!:
cat /etc/shells
Now use the chsh command to change your default shell. On Free and OpenBSD the command to change your default shell to bash is (this may differ if your on another *nix variant so read your man page for chsh):
chsh -s /path/to/bash
If you don't know what variant you are using do this to find out:
uname -a

Monday, September 25, 2006

The traffic within Amazon EC2 and S3 is free

The traffic within Amazon EC2 and S3 is free, so you can have setups as funky as you wish. Remeber what it takes to build Flickr or Livejournal datacenter? Now you can do similar setups from home (unbelievable) and just let Amazon take care of the networking and hardware. This is so much more ‘WebOS’ than Google’s walled garden.

I applaud Amazon.

Please note that EC2 is not limited to web hosting applications, far from it. It makes even more sense to use it for virtual render farms, to run simulations and other tasks that require a lot of computing power but are usually executed only once in a while. So if you need for ex. 100 instance-hours to complete the computation, you can make your own cluster of 20 machines of similar power (will cost about $10000 for hardware alone) and complete the task in 5 hours, you can use EC2 to create this virtual cluster, compute and then shut it down when done and pay much less — $10 per job. Or you could use EC2 to create a 200 machine virtual cluster, complete the job in half an hour and pay the same $10 for it. Think about that.

Sunday, September 24, 2006

Python quick reference

http://www.geocities.com/labspython/pqrc-2.4-a4-latest.pdf

Saturday, September 23, 2006

Python autocomplete and history

Title: Completer with history viewer support and more features
Category:

Description:

This module let "tab" key can indent and completing valid python identifiers, keywords, and filenames.
This module support history view also.

Note1: original python rlcompleter module only avail in unix-like environment
Note2: if you seek a more simple completer you could try Jian Ding Chen's recipe at: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/496812

Source: Text Source

# History.py
#
# Store the file "History.py"
# in site-packages directory like "/usr/lib/python2.4/site-packages,"
# or a directory pointed by PYTHONPATH environment varable,
# or your home directory.
#
# Insert the line, "import History" to "~/.pystartup" file,
# and set an environment variable to point to it:
# "export PYTHONSTARTUP=${HOME}/.pystartup" in bash.
#
# References:
#     Guido van Rossum. Python Tutorial. Python Sfotware Foundation, 2005. 86
#     Jian Ding Chen. Indentable rlcompleter. Python Cookbook Recipe 496812
#     Guido van Rossum. rlcompleter.py. Python Sfotware Foundation, 2005
#
# 2006.06.29 Sunjoong LEE 
#
__author__ = 'Sunjoong LEE '
__date__ = '2006-06-29'
__version__ = '1.0'


from atexit import register
from itertools import count as icount, ifilter, imap
from os import listdir, remove
from os.path import exists, expanduser, split as psplit
from readline import clear_history, get_current_history_length, \
    get_completer_delims, get_history_item, get_line_buffer, \
    insert_text, parse_and_bind, read_history_file, redisplay, \
    set_completer, set_completer_delims, set_history_length, \
    set_pre_input_hook, write_history_file
from pwd import getpwall
from rlcompleter import Completer
from tempfile import mktemp

import __main__


historyPath = expanduser('~/.pyhistory')
HISTORY_LENGTH = 100




class History:
   def __init__(self):
       self.recall()
       set_history_length(HISTORY_LENGTH)

       parse_and_bind('tab: complete')
       # delims = get_completer_delims()
       delims = ' \t\n`!@#$%^&*()-=+[{]}\\|;:,<>?'
       set_completer_delims(delims)
       set_completer(irlcompleter().complete)


   def __repr__(self):
       """print out current history information"""
       command = get_history_item(get_current_history_length())
       if command == 'history':
           length = get_current_history_length()
           if length > 1:
               return reduce(lambda x, y: '%s\n%s' % (x, y),
                             imap(get_history_item, xrange(1, length)))
           else:
               return ''
       else:
           return '<%s instance>' % __name__


   def __call__(self):
       """print out current history information with line number"""
       length = get_current_history_length()
       if length > 1:
           kount = icount(1).next
           for command in imap(get_history_item, xrange(1, length)):
               print '%s\t%s' % (kount(), command)


   def save(self, filename, pos = None, end = None):
       """write history number from pos to end into filename file"""
       length = get_current_history_length()
       if length > 1:
           if not pos:
               pos = 1
           elif pos >= length - 1:
               pos = length - 1
           elif pos < 1:
               pos = length + pos - 1
           if not end:
               end = length
           elif end >= length:
               end = length
           if end < 0:
               end = length + end
           else:
               end = end + 1

           fp = open(filename, 'w')
           write = fp.write
           if pos < end:
               map(lambda x: write('%s\n' %  x),
                   imap(get_history_item, xrange(pos, end)))
           else:
               write('%s\n' % get_history_item(pos))
           fp.close()


   def clear(self):
       """save the current history and clear it"""
       write_history_file(historyPath)
       clear_history()


   def recall(self, historyPath = historyPath):
       """clear the current history and recall it from saved"""
       clear_history()
       if exists(historyPath):
           read_history_file(historyPath)


   def execute(self, pos, end = None):
       """execute history number from pos to end"""
       length = get_current_history_length()
       if length > 1:
           if pos >= length - 1:
               pos = length - 1
           elif pos < 1:
               pos = length + pos - 1
           if not end:
               end = pos + 1
           elif end >= length:
               end = length
           if end < 0:
               end = length + end
           else:
               end = end + 1

           to_execute = map(get_history_item, xrange(pos, end))

           filename = mktemp()
           fp = open(filename, 'w')
           write = fp.write
           map(lambda x: write('%s\n' % x), to_execute.__iter__())
           fp.close()

           try:
               execfile(filename, __main__.__dict__)
               read_history_file(filename)
               remove(filename)
           except:
               remove(filename)




class irlcompleter(Completer):
   def complete(self, text, state):
       if text == '':
           # you could replace '    ' to \t if you indent via tab
           return ['    ', None][state]
       elif text.count("'") == 1:
           if not state:
               self.file_matches(text, "'")
           try:
               return self.matches[state]
           except IndexError:
               return None
       elif text.count('"') == 1:
           if not state:
               self.file_matches(text, '"')
           try:
               return self.matches[state]
           except IndexError:
               return None
       else:
           return Completer.complete(self, text, state)


   def file_matches(self, text, mark):
       if '~' in text:
           if '/' in text:
               text = '%s%s%s' % (mark, expanduser(
                   text[text.find('~'):text.find('/')]),
                                  text[text.find('/'):])
           else:
               self.user_matches(text, mark)
               return

       text1 = text[1:]
       delim = '/'

       if not text1:
           directory = ''
       elif text1 == '.':
           directory = '.'
       elif text1 == '..':
           directory = '..'
       elif text1 == '/':
           directory = '/'
           delim = ''
       elif text1[-1] == '/':
           directory = text1[:-1]
           delim = text1[len(directory):]
       else:
           directory, partial = psplit(text1)
           delim = text1[len(directory):][:-len(partial)]

       if directory:
           listing = map(lambda x: '%s%s%s%s' % (mark, directory, delim, x),
                         listdir(directory).__iter__())
       else:
           listing = map(lambda x: '%s%s' % (mark, x),
                         listdir('.').__iter__())

       n = len(text)
       self.matches = filter(lambda x: x[:n] == text, listing.__iter__())


   def user_matches(self, text, mark):
       n = len(text)
       self.matches = filter(lambda x: x[:n] == text,
                             imap(lambda x: '%s~%s' % (mark, x[0]),
                                  getpwall().__iter__()))




def save_history(historyPath = historyPath):
   from readline import write_history_file
   write_history_file(historyPath)
register(save_history)


def hook():
   from readline import set_pre_input_hook
   import __main__
   set_pre_input_hook()
   delattr(__main__, 'History')
   delattr(__main__, '__file__')
set_pre_input_hook(hook)
setattr(__main__.__builtins__, 'history', History())

Discussion:

To show an example, I had input as below;
i[tab][tab]m[tab] pwd[return]
pri[tab] pwd.[tab]__do[tab][return]
fp = op[tab]('/[tab][tab]e[tab]/pas[tab]', 'r')[return]
hi[tab][return]

>>> i
i if in int is issubclass
id import input intern isinstance iter

>>> import pwd
>>> print pwd.
pwd.__class__ pwd.__init__ pwd.__str__
pwd.__delattr__ pwd.__name__ pwd.getpwall
pwd.__dict__ pwd.__new__ pwd.getpwnam
pwd.__doc__ pwd.__reduce__ pwd.getpwuid
pwd.__file__ pwd.__reduce_ex__ pwd.struct_passwd
pwd.__getattribute__ pwd.__repr__ pwd.struct_pwent
pwd.__hash__ pwd.__setattr__

>>> fp = open('/
'/bin '/home '/mnt '/sbin
'/boot '/lib '/opt '/tmp
'/dev '/lost+found '/proc '/usr
'/etc '/media '/root '/var

>>> history
import pwd
print pwd.__doc__
fp = open('/etc/passwd', 'r')

ipython bitbucket 7zip and Tabs for Amazon S3

Download ipython
or run ipython
for windows
http://ipython.scipy.org/dist/ipython-0.7.2.win32.exe

Download Bitbucket
http://garnaat.org/bitbucket/BitBucket-0.3e.tar.gz

download 7zip and unzip it
use it to unzip bitbucket
and

right click on mycomputer and goto advanced tab, add c:\python\ or whichever dir python is python24 or python25 to path

and then open a new cmd window goto where bitbucket is unzipped

run python setup.py install

now you ve installed bitbucket, same method of installation for most of python packages

now open ipython

run
import bitbucket

bitbucket.c
press tab

it autocompletes
type it out as
a= bitbucket.connect('Access', 'Secret')

a is a bitbucket object to interact with S3
try out a. press tab

Friday, September 22, 2006

Faster yum?

it’s a ~10 times faster yum metadata parser that is reported to also use a lot less memory. This might make fedora suddenly useable on a whole bunch of my machines. I look forward to trying it out.

Should be really easy:
python setup.py build
sudo python setup.py install --prefix=/usr

(Assuming you python prefix is /usr).This implementation should be ~10 times faster, parsing the
primary.xml file under 1 second usually, filelists.xml under 3 seconds and
other.xml under 4 seconds. It uses a lot less memory as well, some testings
I have done show it uses ~4mb instead of 40mb standard YUM uses.

Wednesday, September 20, 2006

json ajax

http://theidioteque.net/blog/2006/05/18/ajax-harmony/
http://www.dynamicajax.com/fr/JSON_AJAX_Web_Chat-271_290_324.html

Tuesday, September 19, 2006

JSON

AJAX without XML Compares using XML, JavaScript Objects, and JSON

Speeding Up AJAX with JSON Demos how easier it is to reference JSON rather than XML.
JSON AJAX Chat Tutorial Simple tutorial that shows how to create a dynamic web chat
JSON/AJAX/PHP Simple tutorial and code libraries

Monday, September 18, 2006

Batara Kesuma Mysql presentation

For the technically minded readers out there who want to get a look into the technical issues behind running a wildly popular SNS, here’s a link to a presentation given by the CTO at the MySQL Users Conference this year. I believe Batara Kesuma gave this presentation in Japan as well, as the content of the presentation is familiar and was covered by some of the local IT press. (I doubt they attended the conference in Santa Clara)

Tuning PostgreSQL for performance

July 3, 2003 Copyright 2003 Shridhar Daithankar and Josh Berkus.
Authorized for re-distribution only under the PostgreSQL license (see www.postgresql.org/license).

1 Introduction
2 Some basic parameters
   2.1 Shared buffers
   2.2 Sort memory
   2.3 Effective Cache Size
   2.4 Fsync and the WAL files
3 Some less known parameters
   3.1 random_ page_cost
   3.2 Vacuum_ mem
   3.3 max_fsm_pages
   3.4 max  fsm_ relations
   3.5 wal_buffers
4 Other tips
   4.1 Check your file system
   4.2 Try the Auto Vacuum daemon
   4.3 Try FreeBSD
5 The CONF Setting Guide

1 Introduction

This is a quick start guide for tuning PostgreSQL's settings for performance. This assumes minimal familiarity with PostgreSQL administration. In particular, one should know,

How to start and stop the postmaster service
How to tune OS parameters
How to test the changes

It also assumes that you have gone through the PostgreSQL administration manual before starting, and to have set up your PostgreSQL server with at least the default configuration.

There are two important things for any performance optimization:

Decide what level of performance you want
If you don't know your expected level of performance, you will end up chasing a carrot always couple of meters ahead of you. The performance tuning measures give diminishing returns after a certain threshold. If you don't set this threshold beforehand, you will end up spending lot of time for minuscule gains.
Know your load
This document focuses entirely tuning postgresql.conf best for your existing setup. This is not the end of performance tuning. After using this document to extract the maximum reasonable performance from your hardware, you should start optimizing your application for efficient data access, which is beyond the scope of this article.

Please also note that the tuning advices described here are hints. You should not implement them all blindly. Tune one parameter at a time and test its impact and decide whether or not you need more tuning. Testing and benchmarking is an integral part of database tuning.

Tuning the software settings explored in this article is only about one-third of database performance tuning, but it's a good start since you can experiment with some basic setting changes in an afternoon, whereas some other aspects of tuning can be very time-consuming. The other two-thirds of database application tuning are:

Hardware Selection and Setup
Databases are very bound to your system's I/O (disk) access and memory usage. As such, selection and configuration of disks, RAID arrays, RAM, operating system, and competition for these resources will have a profound effect on how fast your database is. We hope to have a later article covering this topic.
Efficient Application Design
Your application also needs to be designed to access data efficiently, though careful query writing, planned and tested indexing, good connection management, and avoiding performance pitfalls particular to your version of PostgreSQL. Expect another guide someday helping with this, but really it takes several large books and years of experience to get it right ... or just a lot of time on the mailing lists.

2 Some basic parameters

3 Some less known parameters

3.1 random_page_cost

This parameter sets the cost to fetch a random tuple from the database, which influences the planner's choice of index vs. table scan. This is set to a high value as the default default based on the expectation of slow disk access. If you have reasonably fast disks like SCSI or RAID, you can lower the cost to 2. You need to experiment to find out what works best for your setup by running a variety of queries and comparing execution times.

3.2 Vacuum_mem

This parameter sets the memory allocated to Vacuum. Normally, vacuum is a disk intensive process, but raising this parameter will speed it up by allowing PostgreSQL to copy larger blocks into memory. Just don't set it so high it takes significant memory away from normal database operation. Things between 16-32MB should be good enough for most setups.

3.3 max_fsm_pages

PostgreSQL records free space in each of its data pages. This information is useful for vacuum to find out how many and which pages to look for when it frees up the space.

If you have a database that does lots of updates and deletes, that is going to generate dead tuples, due to PostgreSQL's MVCC system. The space occupied by dead tuples can be freed with vacuum, unless there is more wasted space than is covered by the Free Space Map, in which case the much less convenient "vacuum full" is required. By expanding the FSM to cover all of those dead tuples, you might never again need to run vacuum full except on holidays.

The best way to set max _ fsm _ pages is interactive; First, figure out the vacuum (regular) frequency of your database based on write activity; next, run the database under normal production load, and run "vacuum verbose analyze" instead of vacuum, saving the output to a file; finally, calculate the maximum total number of pages reclaimed between vacuums based on the output, and use that.

Remember, this is a database cluster wide setting. So bump it up enough to cover all databases in your database cluster. Also, each FSM page uses 6 bytes of RAM for administrative overhead, so increasing FSM substantially on systems low on RAM may be counter-productive.

3.4 max _ fsm _ relations

This setting dictates how many number of relations (tables) will be tracked in free space map. Again this is a database cluster-wide setting, so set it accordingly. In version 7.3.3 and later, this parameter should be set correctly as a default. In older versions, bump it up to 300-1000.

3.5 wal_buffers

This setting decides the number of buffers WAL(Write ahead Log) can have. If your database has many write transactions, setting this value bit higher than default could result better usage of disk space. Experiment and decide. A good start would be around 32-64 corresponding to 256-512K memory.

4 Other tips

4.1 Check your file system

On OS like Linux, which offers multiple file systems, one should be careful about choosing the right one from a performance point of view. There is no agreement between PostgreSQL users about which one is best.

As an easy performance boost with no downside, make sure the file system on which your database is kept is mounted "noatime", which turns off the access time bookkeeping.

4.2 Try the Auto Vacuum daemon

There is a little known module in PostgreSQL contrib directory called as pgavd. It works in conjunction with statistics collector. It periodically connects to a database and checks if it has done enough operations since the last check. If yes, it will vacuum the database.

Essentially it will vacuum the database when it needs it. It would get rid of playing with cron settings for vacuum frequency. It should result in better database performance by eliminating overdue vacuum issues.

4.3 Try FreeBSD

Large updates, deletes, and vacuum in PostgreSQL are very disk intensive processes. In particular, since vacuum gobbles up IO bandwidth, the rest of the database activities could be affected adversely when vacuuming very large tables.

OS's from the BSD family, such as FreeBSD, dynamically alter the IO priority of a process. So if you lower the priority of a vacuum process, it should not chew as much bandwidth and will better allow the database to perform normally. Of course this means that vacuum could take longer, which would be problematic for a "vacuum full."

If you are not done with your choice of OS for your server platform, consider BSD for this reason.

5 The CONF Setting Guide

Available here is an Annotated Guide to the PostgreSQL configuration file settings, in both OpenOffice.org and PDF format. This guide expands on the official documentation and may eventually be incorporated into it.

The first column of the chart is the GUC setting in the postgresql.conf file.
The second is the maximum range of the variable; note that the maximum range is often much larger than the practical range. For example, random_page_cost will accept any number between 0 and several billion, but all practical numbers are between 1 and 5.
The third column contains an enumeration of RAM or disk space used by each unit of the parameter.
The fourth column indicates whether or not the variable may be SET from the PSQL terminal during an interactive setting. Most settings marked as "no" may only be changed by restarting PostgreSQL.
The fifth column quotes the official documentation available from the PostgreSQL web site.
The last column is our notes on the setting, how to set it, resources it uses, etc. You'll notice some blank spaces, and should be warned as well that there is still strong disagreement on the value of many settings.

Users of PostgreSQL 7.3 and earlier will notice that the order of the parameters in this guide do not match the order of the parameters in your postgresql.conf file. This is because this document was generated as part of an effort to re-organize the conf parameters and documentation; starting with 7.4, this document, the official documentation, and the postgresql.conf file are all in the same logical order.

As noted in the worksheet, it covers PostgreSQL versions 7.3 and 7.4. If you are using an earlier version, you will not have access to all of these settings, and defaults and effects of some settings will be different.

PostgreSQL 8.0 Performance Checklist

This is a set of rules of thumb for setting up your PostgreSQL 8.0 server. A lot of the below is based on anecdotal evidence or practical scaling tests; there's a lot about database performance that we, and OSDL, are still working out. However, this should get you started. All information below is useful as of January 12, 2005 and will likely be updated later. Discussions of settings below supercede the recommendations I've made on General Bits.

Five Hardware Principles of Setting Up Your PostgreSQL Server

Disks > RAM > CPU
If you're going to throw money at a PostgreSQL server, throw it at high-performance disk arrays and get average processors and adequate RAM. If you have a little more money, get more RAM. PostgreSQL, like other ACID-compliant RDBMSes, is very I/O intensive, and it's a rare application that taxes the CPU more than the SCSI card (some do exist, though). This applies to small servers as well as it applies to eight-ways with NetApps; get the cost-effective CPU if it allows you to buy a high-end RAID card and many disks.
More Spindles == Better
Given multiple disks, PostgreSQL and most operating systems will parallelize read and write requests on the database. This makes an enormous difference on transaction-processing systems, and a significant improvement on any application where the entire database does not fit in RAM. Given today's minimum disk sizes (72GB) you might be tempted to use just one disk, or a single RAID 1 mirrored pair; however, you'll find that using 4, 6, or even 14 disks will yield performance boosts. Oh, and SCSI is still signifiacantly higher DB throughput than IDE, even with Serial ATA.
Seperate the Transaction Log from the Database:
Assuming that you've already forked out the money for a decent-sized array, there are more intelligent options than throwing everything on a single RAID. For one thing, putting the database transaction log (pg_xlog) on its own, dedicated disk resource (an array or plain disk), makes as much as a 12% difference in performance on databases with high write activity. This is especially vital on small systems with slow SCSI or IDE disks: even in a two-disk server, you can put the transaction log onto the operating system disk and reap some benefits.
RAID 1+0/0+1 > RAID 5:
RAID 5 with 3 disks has become an unfortunate standard among large vendor's economy servers. This is possibly the slowest array configuration possible for PostgreSQL; you can expect as little as 50% of the query speed as you would get with a plain SCSI disk. Instead, focus on RAID 1 or 1+0 or 0+1 for any set of 2, 4 or 6 disks. Over 6 disks, RAID 5 starts to perform acceptably again, and the comparison tends to be a lot more on the basis of your individual controller. Perhaps more importantly, a cheap RAID card can be a liability; often it is better to use software RAID than the Adaptec built-in card that came with your server.
Applications must play nice together:
The other big mistake I see a lot of organizations making is putting PostgreSQL together on a server with several other applications which compete for the same resources. Worst among these is putting PostgreSQL and other RDBMSes on the same machine; both database systems will fight over disk bandwith and the OS disk cache, and both will perform poorly. Document servers and security logging programs are almost as bad. PostgreSQL can share a machine with applications which are mainly CPU-and-RAM-intensive, such as Apache, provided that there is enough RAM.

Twelve Settings You'll Want To Adjust in Your PostgreSQL.Conf File

There's a truly frightening amount of new options in the PostgreSQL.conf file. Even once-familiar options from the last 5 versions have changed names and parameter formats. It is intended to give you, the database administrator, more control, but can take some getting used to.

What follows are the settings that most DBAs will want to change, focused more on performance than anything else. There are quite a few "specialty" settings which most users won't touch, but those that use them will find indispensable. For those, you'll have to wait for the book.

Remember: PostgreSQL.conf settings must be uncommented to take effect, but re-commenting them does not necessarily restore the default values!

Connecting

listen_addresses: Replaces both the tcp_ip and virtual_hosts settings from 7.4. Defaults to localhost in most installations, allowing only connections on the console. Many DBAs will want to set this to "*", meaning all available interfaces, after setting proper permissions in the pg_hba.conf file, in order to make PostgreSQL accessable to the network. As an improvment over previous versions, the "localhost" default does permit connections on the "loopback" interface, 127.0.0.1, enabling many server browser-based utilities.

max_connections: exactly like previous versions, this needs to be set to the actual number of simultaneous connections you expect to need. High settings will require more shared memory (shared_buffers). As the per-connection overhead, both from PostgreSQL and the host OS, can be quite high, it's important to use connection pooling if you need to service a large number of users. For example, 150 active connections on a medium-end single-processor 32-bit Linux server will consume significant system resources, and 600 is about the limit on that hardware. Of course, beefier hardware will allow more connections.

Memory

shared_buffers: As a reminder: This figure is NOT the total memory PostgreSQL has to work with. It is the block of dedicated memory PostgreSQL uses for active operations, and should be a minority of your total RAM on the machine, since PostgreSQL uses the OS disk cache as well. Unfortunately, the exact amount of shared buffers required is a complex calculation of total RAM, database size, number of connections, and query complexity. Thus it's better to go with some rules of thumb in allocating, and monitor the server (particuarly pg_statio views) to determine adjustments.
On dedicated servers, useful values seem to be between between 8MB and 400MB (between 1000 and 50,000 for 8K page size). Factors which raise the desired shared buffers are larger active portions of the database, large complex queries, large numbers of simultaneous queries, long-running procedures or transactions, more available RAM, and faster/more CPUs. And, of course, other applications on the machine. Contrary to some expectations, allocating much too much shared_buffers can actually lower peformance, due time required for scanning. Here's some examples based on anecdotes and TPC tests on Linux machines:

Laptop, Celeron processor, 384MB RAM, 25MB database: 12MB/1500
Athlon server, 1GB RAM, 10GB decision-support database: 120MB/15000
Quad PIII server, 4GB RAM, 40GB, 150-connection heavy transaction processing database: 240MB/30000
Quad Xeon server, 8GB RAM, 200GB, 300-connection heavy transaction processing database: 400MB/50000

Please note that increasing shared_buffers, and a few other memory parameters, will require you to modify your operating system's System V memory parameters. See the main PostgreSQL documentation for instructions on this.

work_mem: used to be called sort_mem, but has been re-named since it now covers sorts, aggregates, and a few other operations. This is non-shared memory, which is allocated per-operation (one to several times per query); the setting is here to put a ceiling on the amount of RAM any single operation can grab before being forced to disk. This should be set according to a calculation based on dividing the available RAM (after applications and shared_buffers) by the expected maximum concurrent queries times the average number of memory-using operations per query.
Consideration should also be paid to the amount of work_mem needed by each query; processing large data sets requires more. Web database applications generally set this quite low, as the number of connections is high but queries are simple; 512K to 2048K generally suffices. Contrawise, decision support applications with their 160-line queries and 10 million-row aggregates often need quite a lot, as much as 500MB on a high-memory server. For mixed-use databases, this parameter can be set per connection, at query time, in order to give more RAM to specific queries.

maintenance_work_mem: formerly called vacuum_mem, this is the quantity of RAM PostgreSQL uses for VACUUM, ANALYZE, CREATE INDEX, and adding foriegn keys. You should raise it the larger your database tables are, and the more RAM you have to spare, in order to make these operations as fast as possible. A setting of 50% to 75% of the on-disk size of your largest table or index is a good rule, or 32MB to 256MB where this can't be determined.

Disk and WAL

checkpoint_segments: defines the on-disk cache size of the transaction log for write operations. You can ignore this in mostly-read web database, but for transaction processing databases or reporting databases involving large data loads, raising it is performance-critical. Depening on the volume of data, raise it to between 12 and 256 segments, starting conservatively and raising it if you start to see warning messages in the log. The space required on disk is equal to (checkpoint_segments * 2 + 1) * 16MB, so make sure you have enough disk space (32 means over 1GB).

max_fsm_pages: sizes the register which tracks partially empty data pages for population with new data; if set right, makes VACUUM faster and removes the need for VACUUM FULL or REINDEX. Should be slightly more than the total number of data pages which will be touched by updates and deletes between vacuums. The two ways to determine this number are to run VACUUM VERBOSE ANALYZE, or if using autovacuum (see below) set this according to the -V setting as a percentage of the total data pages used by your database. fsm_pages require very little memory, so it's better to be generous here.

vacuum_cost_delay: If you have large tables and a significant amount of concurrent write activity, you may want to make use of a new feature which lowers the I/O burden of VACUUMs at the cost of making them take longer. As this is a very new feature, it's a complex of 5 dependant settings for which we have only a few performance tests. Increasing vacuum_cost_delay to a non-zero value turns the feature on; use a reasonable delay, somewhere between 50 and 200ms. For fine tuning, increasing vacuum_cost_page_hit and decreasing vacuum_cost_page_limit will soften the impact of vacuums and make them take longer; in Jan Wieck's tests on a transaction processing test, a delay of 200, page_hit of 6 and limit of 100 decreased the impact of vacuum by more than 80% while tripling the execution time.

Query Planner

These settings allow the query planner to make accurate estimates of operation costs, and thus pick the best possible query plan. There are two global settings worth bothering with:

effective_cache_size: tells the query planner the largest possible database object that could be expected to be cached. Generally should be set to about 2/3 of RAM, if on a dedicated server. On a mixed-use server, you'll have to estimate how much of the RAM and OS disk cache other applications will be using and subtract that.

random_page_cost: a variable which estimates the average cost of doing seeks for index-fetched data pages. On faster machines, with faster disks arrays, this should be lowered, to 3.0, 2.5 or even 2.0. However, if the active portion of your database is many times larger than RAM, you will want to raise the factor back towards the default of 4.0. Alternatively, you can base adjustments on query performance. If the planner seems to be unfairly favoring sequential scans over index scans, lower it; if it's using slow indexes when it shouldn't, raise it. Make sure you test a variety of queries. Do not lower it below 2.0; if that seems necessary, you need to adjust in other areas, like planner statistics.

Logging

log_destination: this replaces the unintuitive syslog setting in prior versions. Your choices are to use the OS's administrative log (syslog or eventlog) or to use a seperate PostgreSQL log (stderr). The former is better for system monitoring; the latter, better for database troubleshooting and tuning.

redirect_stderr: If you decide to go with a seperate PostgreSQL log, this setting allows you to log to a file using a native PostgreSQL utility instead of command-line redirection, allowing automated log rotation. Set it to True, and then set log_directory to tell it where to put the logs. The default settings for log_filename, log_rotation_size, and log_rotation_age are good for most people.

Autovacuum and You

As you tumble toward production on 8.0, you're going to want to set up a maintenance plan which includes VACUUMs and ANALYZEs. If your database involves a fairly steady flow of data writes, but does not require massive data loads and deletions or frequent restarts, this should mean setting up pg_autovacuum. It's better than time-scheduled vacuums because:

Tables are vacuumed based on their activity, avoiding vacuuming read-only tables.
The frequency of vacuums grows automatically with increasing database activity.
It's easier to calculate free space map requirements and avoid database bloat.

Setting up autovacuum requires an easy build of the module in the contrib/pg_autovacuum directory of your PostgreSQL source (Windows users should find autovaccuum included in the PGInstaller package). You turn on the stats configuration settings detailed in the README. Then you start autovacuum after PostgreSQL is started as a seperate process; it will shut down automatically when PostgreSQL shuts down.

The default settings for autovacuum are very conservative, though, and are more suitable for a very small database. I generally use something aggressive like:
-D -v 400 -V 0.4 -a 100 -A 0.3
This vacuums tables after 400 rows + 40% of the table has been updated or deleted, and analyzes after 100 rows + 30% of the table has been inserted, updated or deleted. The above configuration also lets me set my max_fsm_pages to 50% of the data pages in the database with confidence that that number won't be overrun, causing database bloat. We are currently testing various settings at OSDL and will have more hard figures on the above soon.

Note that you can also use autovacuum to set the Vacuum Delay options, instead of setting them in PostgreSQL.conf. Vacuum Delay can be vitally important for systems with very large tables or indexes; otherwise an untimely autovacuum call can halt an important operation.

There are, unfortunately, a couple of serious limitations to 8.0's autovacuum which will hopefully be eliminated in future versions:

Has no long-term memory: autovacuum forgets all activity tracking when you restart the database. So if you do regular restarts, you should do a full-database VACUUM ANALYZE right before or after.
Pays no attention to how busy the server is: there were plans to check system load before vacuuming, but that's not a current feature. So if you have extreme load peaks, autovacuum may not be for you.

Performance Tuning PostgreSQL

Introduction

PostgreSQL is the most advanced and flexible Open Source SQL database today. With this power and flexibility comes a problem. How do the PostgreSQL developers tune the default configuration for everyone? Unfortunately the answer is they can't.

The problem is that every database is not only different in its design, but also its requirements. Some systems are used to log mountains of data that is almost never queried. Others have essentially static data that is queried constantly, sometimes feverishly. Most systems however have some, usually unequal, level of reads and writes to the database. Add this little complexity on top of your totally unique table structure, data, and hardware configuration and hopefully you begin to see why tuning can be difficult.

The default configuration PostgreSQL ships with is a very solid configuration aimed at everyone's best guess as to how an "average" database on "average" hardware should be setup. This article aims to help PostgreSQL users of all levels better understand PostgreSQL performance tuning.

Understanding the process

The first step to learning how to tune your PostgreSQL database is to understand the life cycle of a query. Here are the steps of a query:

Transmission of query string to database backend
Parsing of query string
Planning of query to optimize retrieval of data
Retrieval of data from hardware
Transmission of results to client

The first step is the sending of the query string ( the actual SQL command you type in or your application uses ) to the database backend. There isn't much you can tune about this step, however if you have a very large queries that cannot be prepared in advance it may help to put them into the database as a stored procedure and cut the data transfer down to a minimum.

Once the SQL query is inside the database server it is parsed into tokens. This step can also be minimized by using stored procedures.

The planning of the query is where PostgreSQL really starts to do some work. This stage checks to see if the query is already prepared if your version of PostgreSQL and client library support this feature. It also analyzes your SQL to determine what the most efficient way of retrieving your data is. Should we use an index and if so which one? Maybe a hash join on those two tables is appropriate? These are some of the decisions the database makes at this point of the process. This step can be eliminated if the query is previously prepared.

Now that PostgreSQL has a plan of what it believes to be the best way to retrieve the data, it is time to actually get it. While there are some tuning options that help here, this step is mostly effected by your hardware configuration.

And finally the last step is to transmit the results to the client. While there aren't any real tuning options for this step, you should be aware that all of the data that you are returning is pulled from the disk and sent over the wire to your client. Minimizing the number of rows and columns to only those that are necessary can often increase your performance.

General Tuning

There are several postmaster options that can be set that drastically affect performance, below is a list of the most commonly used and how they effect performance:

max_connections = — This option sets the maximum number of database backend to have at any one time. Use this feature to ensure that you do not launch so many backends that you begin swapping to disk and kill the performance of all the children. Depending on your application it may be better to deny the connection entirely rather than degrade the performance of all of the other children.
shared_buffers = — Editing this option is the simplest way to improve the performance of your database server. The default of 1000 is very low for most modern hardware. Each buffer is typically 8192 bytes unless changed at compile time. General wisdom says that this should be set to roughly 10-15% of available RAM on the system. Like most of the options I will outline here you will simply need to try them and see how well it works on your system. I would recommend playing in the 10,000 to 20,000 shared_buffers range. Much more than that and you can actually degrade performance.
work_mem = — This option is used to control the amount of memory using in sort operations and hash tables. While you may need to increase the amount of memory if you do a ton of sorting in your application, care needs to be taken. This isn't a system wide parameter, but a per operation one. So if a complex query has several sort operations in it it will use multiple work_mem units of memory. Not to mention that multiple backends could be doing this at once. This query can often lead your database server to swap if the value is too large. This option was previously called sort_mem in older versions of PostgreSQL.
max_fsm_pages = — This option helps to control the free space map. When something is deleted from a table it isn't removed from the disk immediately, it is simply marked as "free" in the free space map. The space can then be reused for any new INSERTs that you do on the table. If your setup has a high rate of DELETEs and INSERTs it may be necessary increase this value to avoid table bloat.
fsync = — This option determines if all your WAL pages are fsync()'ed to disk before a transactions is committed. Having this on is safer, but can reduce write performance. If fsync is not enabled there is the chance of unrecoverable data corruption. Turn this off at your own risk.
commit_delay = and commit_siblings = — These options are used in concert to help improve performance by writing out multiple transactions that are committing at once. If there are commit_siblings number of backends active at the instant your transaction is committing then the server waiting commit_delay microseconds to try and commit multiple transactions at once.
effective_cache_size = — This value helps PostgreSQL's optimizer to determine how effective the operating system's disk cache is when determining if it should use an index or not. The larger the value increases the likely hood of using an index. You should play with this value based on the amount of disk pages your hardware/operating system combo typically has as disk cache.
random_page_cost = — random_page_cost controls the way PostgreSQL views non-sequential disk reads. A higher value makes it more likely that a sequential scan will be used over an index scan indicating that your server has very fast disks.

Note that many of these options consume shared memory and it will probably be necessary to increase the amount of shared memory allowed on your system to get the most out of these options.

Hardware Issues

Obviously the type and quality of the hardware you use for your database server drastically impacts the performance of your database. Here are a few tips to use when purchasing hardware for your database server:

CPUs — The more CPUs the better, however if your database does not use many complex functions your money is best spent on a better disk subsystem. Also, avoid Intel Xeon processors with PostgreSQL as there is a problem with the context switching in these processors that gives sub-par performance. Opterons are generally accepted as being a superior CPU for PostgreSQL databases.
RAM — The more RAM you have the more disk cache you will have. This greatly impacts performance considering memory I/O is thousands of times faster than disk I/O.
Disk types — Obviously fast Ultra-320 SCSI disks are your best option, however high end SATA drives are also very good. With SATA each disk is substantially cheaper and with that you can afford more spindles than with SCSI on the same budget.
Disk configuration — The optimum configuration is RAID 1+0 with as many disks as possible and with your transaction log (pg_xlog) on a separate disk all by itself. RAID 5 is not a very good option for databases unless you have more than 6 disks in your volume. With newer versions of PostgreSQL you can also use the tablespaces option to put different tables, databases, and indexes on different disks to help optimize performance. Such as putting your often used tables on a fast SCSI disk and the less used ones slower IDE or SATA drives.

In general the more RAM and disk spindles you have in your system the better it will perform. This is because with the extra RAM you will access your disks less. And the extra spindles help spread the reads and writes over multiple disks to increase throughput and to reduce drive head congestion.

Another good idea is to separate your application code and your database server onto different hardware. Not only does this provide more hardware dedicated to the database server, but the operating system's disk cache will contain more PostgreSQL data and not other various application or system data this way.

For example, if you have one web server and one database server you can use a cross-over cable on a separate ethernet interface to handle just the web server to database network traffic to ensure you reduce any possible bottlenecks there. You can also obviously create an entirely different physical network for database traffic if you have multiple servers that access the same database server.

Useful Tuning Tools

The most useful tool in tuning your database is the SQL command EXPLAIN ANALYZE. This allows you to profile each SQL query your application performs and see exactly how the PostgreSQL planner will process the query. Let's look at a short example, below is a simple table structure and query.

CREATE TABLE authors (
   id    int4 PRIMARY KEY,
   name  varchar
);

CREATE TABLE books (
   id          int4 PRIMARY KEY,
   author_id   int4,
   title       varchar
);

If we use the query:


EXPLAIN ANALYZE SELECT authors.name, books.title FROM books, authors
WHERE books.author_id=16 and authors.id = books.author_id ORDER BY books.title;

You will get output similar to the following:

                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=29.71..29.73 rows=6 width=64) (actual time=0.189..16.233 rows=7 loops=1)
  Sort Key: books.title
  ->  Nested Loop  (cost=0.00..29.63 rows=6 width=64) (actual time=0.068..0.129 rows=7 loops=1)
        ->  Index Scan using authors_pkey on authors  (cost=0.00..5.82 rows=1 width=36) (actual time=0.029..0.033 rows=1 loops=1)
              Index Cond: (id = 16)
        ->  Seq Scan on books  (cost=0.00..23.75 rows=6 width=36) (actual time=0.026..0.052 rows=7 loops=1)
              Filter: (author_id = 16)
Total runtime: 16.386 ms

You need to read this output from bottom to top when analyzing it. The first thing PostgreSQL does is do a sequence scan on the books table looking at each author_id column for values that equal 16. Then it does an index scan of the authors table, because of the implicit index created by the PRIMARY KEY options. The finally the results our sorted by books.title.

The values you see in parenthesis are the estimated and actual cost of that portion of the query. The closer together the estimate and the actual costs are the better performance you will typically see.

Now, let's change the structure a little bit by adding an index on books.author_id to avoid the sequence scan with this command:

CREATE INDEX books_idx1 on books(author_id);

If you rerun the query again, you won't see any noticeable change in the output. This is because PostgreSQL has not yet re-analyzed the data and determined that the new index may help for this query. This can be solved by running:

ANALYZE books;

However, in this small test case I'm working with the planner still favors the sequence scan because there aren't very many rows in my books table. If a query is going to return a large portion of a table then the planner chooses a sequence scan over an index because it is actually faster. You can also force PostgreSQL to favor index scans over sequential scans by setting the configuration parameter enable_seqscan to off. This doesn't remove all sequence scans, since some tables may not have an index, but it does force the planner's hand into always using an index scan when it is available. This is probably best done by sending the command SET enable_seqscan = off at the start of every connection rather than setting this option database wide. This way you can control via your application code when this is in effect. However, in general disabling sequence scans should only be used in tuning your application and is not really intended for every day use.

Typically the best way to optimize your queries is to use indexes on specific columns and combinations of columns to correspond to often used queries. Unfortunately this is done by trial and error. You should also note that increasing the number of indexes on a table increases the number of write operations that need to be performed for each INSERT and UPDATE. So don't do anything silly and just add indexes for each column in each table.

You can help PostgreSQL do what you want by playing with the level of statistics that are gathered on a table or column with the command:

ALTER TABLE ALTER COLUMN  SET STATISTICS ;

This value can be a number between 0 and 1000 and helps PostgreSQL determine what level of statistics gathering should be performed on that column. This helps you to control the generated query plans without having slow vacuum and analyze operations because of generating large amounts of stats for all tables and columns.

Another useful tool to help determine how to tune your database is to turn on query logging. You can tell PostgreSQL which queries you are interested in logging via the log_statement configuration option. This is very useful in situations where you many users executing ad hoc queries to your system via something like Crystal Reports or via psql directly.

Database Design and Layout

Sometimes the design and layout of your database affects performance. For example, if you have an employee database that looks like this:

CREATE TABLE employees (
   id                int4 PRIMARY KEY,
   active            boolean,
   first_name        varchar,
   middle_name       varchar,
   last_name         varchar,
   ssn               varchar,
   address1          varchar,
   address2          varchar,
   city              varchar,
   state             varchar(2),
   zip               varchar,
   home_phone        varchar,
   work_phone        varchar,
   cell_phone        varchar,
   fax_phone         varchar,
   pager_number      varchar,
   business_email    varchar,
   personal_email    varchar,
   salary            int4,
   vacation_days     int2,
   sick_days         int2,
   employee_number   int4,
   office_addr_1     varchar,
   office_addr_2     varchar,
   office_city       varchar,
   office_state      varchar(2),
   office_zip        varchar,
   department        varchar,
   title             varchar,
   supervisor_id     int4
);

This design is easy to understand, but isn't very good on several levels. While it will depend on your particular application, in most cases you won't need to access all of this data at one time. In portions of your application that deal with HR functions you are probably only interested in their name, salary, vacation time, and sick days. However, if the application displays an organization chart it would only be concerned with the department and supervisor_id portions of the table.

By breaking up this table into smaller tables you can get more efficient queries since PostgreSQL has less to read through, not to mention better functionality. Below is one way to make this structure better:

CREATE TABLE employees (
   id               int4 PRIMARY KEY,
   active           boolean,
   employee_number  int4,
   first_name       varchar,
   middle_name      varchar,
   last_name        varchar,
   department       varchar,
   title            varchar,
   email            varchar
);

CREATE TABLE employee_address (
   id               int4 PRIMARY KEY,
   employee_id      int4,
   personal         boolean,
   address_1        varchar,
   address_2        varchar,
   city             varchar,
   state            varchar(2),
   zip              varchar
);

CREATE TABLE employee_number_type (
   id               int4 PRIMARY KEY,
   type             varchar
);

CREATE TABLE employee_number (
   id               int4 PRIMARY KEY,
   employee_id      int4,
   type_id          int4,
   number           varchar
);

CREATE TABLE employee_hr_info (
   id               int4 PRIMARY KEY,
   employee_id      int4,
   ssn              varchar,
   salary           int4,
   vacation_days    int2,
   sick_days        int2
);

With this table structure the data associated with an employee is broken out into logical groupings. The main table contains the most frequently used information and the other tables store all of the rest of the information. The added benefit of this layout is that you can have any number of phone numbers and addresses associated with a particular employee now.

Another useful tip is to use partial indexes on columns where you typically query a certain value more often than another. Take for example the employee table above. You're probably only displaying active employees throughout the majority of the application, but creating a partial index on that column where the value is true can help speed up the query and may help the planner to choose to use the index in cases where it otherwise would not. You can create a partial index like this:

CREATE INDEX employee_idx2 ON employee(active) WHERE active='t';

Or you may have a situation where a row has a column named 'employee_id' that is null until the row is associated with an employee, maybe in some trouble ticket like system. In that type of application you would probably have a 'View Unassigned Tickets' portion of the application which would benefit from a partial index such as this:

CREATE INDEX tickets_idx1 ON tickets(employee_id) WHERE employee_id IS NULL;

Application Development

There are many different ways to build applications which use a SQL database, but there are two very common themes that I will call stateless and stateful. In the area of performance there are different issues that impact each.

Stateless is typically the access type used by web based applications. Your software connects to the database, issues a couple of queries, returns to results to the user, and disconnects. The next action the users takes restarts this process with a new connect, new set of queries, etc.

Stateful applications are typically non-web based user interfaces where an application initiates a database connection and holds it open for the duration the application is in use.

Stateless Applications

In web based applications each time something is requested by the user , the application initiates a new database connection. While PostgreSQL has a very short connection creation time and in general it is not a very expensive operation, it is best to use some sort of database connection pooling method to get maximum performance.

There are several ways to accomplish database connection pooling, here is a short list of common ones:

Pgpool is a small server that you run on the same server as your clients that will pool database connections to some local or remote server. The application simply points at the pgpool instance instead of the normal postmaster. From the application's perspective nothing has changed as the connection pooling is hidden from it.
In a mod_perl environment you can use Apache::DBI to handle database connection pooling inside of Apache itself.
SQLRelay is another db connection manager that is somewhat database agnostic. It works with with several databases other than PostgreSQL.
You can always write a small bit of code to do this for you yourself, but I would highly recommend using an already developed solution to reduce the amount of debugging you have to do.

It should be noted that in a few bizarre instances I've actually seen database connection pooling reduce the performance of web based applications. At a certain point the cost of handling the pooling is more expensive than simply creating a new connection. I suggest testing it both ways to see which is best for your environment.

Stateful Applications

When building stateful applications you should look into using database cursors via the DECLARE command. A cursor allows you to plan and execute a query, but only pull back the data as you need it, for example one row at a time. This can greatly increase the snappiness of the UI.

General Application Issues

These issues typically effect both stateful and stateless applications in the same fashion. One good technique is to use server side prepared queries for any queries you execute often. This reduces the overall query time by caching the query plan for later use.

It should be noted however if you prepare a query in advance using placeholder values ( such as 'column_name = ?' ) then the planner will not always be able to choose the best plan. For example, your query has a placeholder for the boolean column 'active' and you have a partial index on false values the planner won't use it because it cannot be sure the value passed in on execution will be true or false.

You can also obviously utilize stored procedures here to reduce the transmit, parse, and plan portions of the typical query life cycle. It is best to profile your application and find commonly used queries and data manipulations and put them into a stored procedure.

Other Useful Resources

Here is a short list of other items that may be of help.

PostgreSQL Homepage — The obvious place for all things PostgreSQL.
psql-performance mailing list — This PostgreSQL mailing list is focused on performance related questions and discussions.
General PostgreSQL 7.4 performance tuning tips from Varlena.com. The page isn't for the latest version, but still contains good information.
Annotated .conf file An annotated configuration file and some other useful information.
PostgreSQL 8.0 Performance Checklist

web.debug

What was your create statement?

You can always put a line in like:
web.debug(web.delete('todotable',int(todo_id), _test=True))
and run the resulting query to see what's wrong. The short version
seems to be that you didn't name your column 'id'.

unicode gotchas

Hum, I don't have any problems, and I don't do anything special ...

- I use PostgreSQL, so I created my database using UTF-8 encoding.
- my Python modules start with "# -*- coding: utf-8 -*-".
- all my modules and my templates are utf-8 encoded (I use Vim, so I
use ":set encoding=utf-8", but it should work with any good
text-editor).

The only 'encoding trick' I use is when I want to print an exception,
catched from a bad database query. I need to do something like this :

===============
except Exception, detail:
print "blablabla : %s" % str(detail).decode('latin1')
return
===============

... since the exception message (which is in french) seems to be latin1
encoded.

That's all :)

Jonathan

ps : it should be the same with sqlite

Sunday, September 17, 2006

Remove noatun and install kmix

sudo yum remove kdemultimedia
sudo yum install kdemultimedia-kmix
http://rpmfind.net/linux/rpm2html/search.php?query=kdemultimedia-kmix

Saturday, September 16, 2006

Postgresql DB applications CMU

http://www.cs.cmu.edu/~natassa/courses/15-415/current/slides/

Friday, September 15, 2006

End to End Unicode Web Applications in Python

The following is a brief discussion of creating a web application with Python that uses Unicode. This discussion is not a thorough exploration of Unicode or Python's Unicode support. Rather, it is a purely practical overview that constitutes much of what most Python web developers need to know.

Python Unicode Objects

Unicode is a complex solution to a complex problem of meeting a simple need. The need is to permit software to handle the writing systems of (nearly) all the human languages of the world. The Unicode standard does this remarkably well, and most importantly, does it in such a way that you, the programmer, don't have to worry much about it.

What you do have to understand is that Unicode strings are multi-byte (binary) strings and therefore have some special requirements that ASCII strings do not. The good news is that you're using Python, which has a sensible approach to handling Unicode strings. Let's look at one:

>>> myString = 'This is a string' # this is a standard string
>>> myUnicodeString = u'This is a string' # this is a Unicode string

Python tries to treat Unicode strings as much like ASCII strings as possible. For the most part, if you have a Unicode string in Python, you can work with it exactly like you would an ASCII string. You can even mingle them. For example, if you concatenate the above variables, you'll get a Unicode string that looks like this:

>>> myString + myUnicode
u'This is a stringThis is a string'

Since the one string is Unicode, Python automatically translates the other to Unicode in the process of concatenation and returns a Unicode result. (Be sure to read section 3.1.3 of the Python tutorial for more examples and detail.) The great consequence here is that, internally, your code doesn't have to worry much about what's Unicode: it just works.

Encodings

So far, we've looked at Unicode strings as live objects in Python. They are straightforward enough. The trick is actually getting the Unicode string in the first place, or sending it somewhere else (to storage, for instance) once you're done with it.

Unicode in its native form will not pass through many common interfaces, such as HTTP, because those interfaces are only designed to work with 7- or 8-bit ASCII. Therefore, Unicode data is generally stored or transmitted through network systems in encoded form, as a string of ASCII characters. There are many possible ways to encode thusly. (The various encodings are documented in depth elsewhere.)

Encodings are a significant source of confusion for newcomers to Unicode. The common mistake is to think that an encoded string (of UTF-8, for instance) is the same thing as Unicode, when it's actually one of many possible ways to encode Unicode in ASCII form. There is only one Unicode. (You can play around with the Unicode database through Python's Unicodedata module.) There are many encodings, all of which point back to the one Unicode. Different encodings are more or less useful depending on your application.

In the web development context, there is only one encoding that will likely be of interest to you: UTF-8. For contrast, however, we will also look at UTF-16, another encoding that is particularly affiliated with XML. UTF-8 is the most common encoding in the web environment because it looks a lot like the ASCII equivalent of the text (at least until you start encountering extended characters or any of the thousands of glyphs that aren't part of ASCII). Consequently, UTF-8 is perceived as friendlier than UTF-16 or other encodings. More importantly, UTF-8 is the only Unicode encoding supported by most web browsers, although most web browsers support a large number of legacy non-Unicode encodings. On the other hand, UTF-16 looks like ASCII-encoded binary data. (Which it is.) Let's look at these two encodings.

>>> myUnicodeString.encode('utf-8')
'This is a string'
>>> myUnicodeString.encode('utf-16')
'\xff\xfeT\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00'

The important thing to note is that the result of calling the encode method is an ASCII string. We've taken a Unicode string and encoded it into ASCII that can be stored or transmitted through any mechanism that handles ASCII, like the Web.

For comparison, let's look at the encoded versions of the following string:

The following are some random Cyrillic characters: БГДЕЖЗЙФЮ

In UTF-8 (note the ASCII equivalents showing through):

The following are some random Cyrillic characters: \xd0\x91\xd0\x93\xd0\x94\xd0\x95\xd0\x96\xd0\x97\xd0\x99\xd0\xa4\xd0\xae

In UTF-16:

'\xff\xfeT\x00h\x00e\x00 \x00f\x00o\x00l\x00l\x00o\x00w\x00i\x00n\x00g\x00 \x00a\x00r\x00e\x00 \x00s\x00o\x00m\x00e\x00 \x00r\x00a\x00n\x00d\x00o\x00m\x00 \x00C\x00y\x00r\x00i\x00l\x00l\x00i\x00c\x00 \x00c\x00h\x00a\x00r\x00a\x00c\x00t\x00e\x00r\x00s\x00:\x00 \x00\x11\x04\x13\x04\x14\x04\x15\x04\x16\x04\x17\x04\x19\x04$\x04.\x04'

Now, let's decode these encoded strings in the python command line:

>>> foo = unicode('\xff\xfeT\x00h\x00e\x00 \x00f\x00o\x00l\x00l\x00o\x00w\x00i\x00n\x00g\x00 \x00a\x00r\x00e\x00 \x00s\x00o\x00m\x00e\x00 \x00r\x00a\x00n\x00d\x00o\x00m\x00 \x00C\x00y\x00r\x00i\x00l\x00l\x00i\x00c\x00 \x00c\x00h\x00a\x00r\x00a\x00c\x00t\x00e\x00r\x00s\x00:\x00 \x00\x11\x04\x13\x04\x14\x04\x15\x04\x16\x04\x17\x04\x19\x04$\x04.\x04','utf-16')
>>> foo
u'The following are some random Cyrillic characters: \u0411\u0413\u0414\u0415\u0416\u0417\u0419\u0424\u042e'

When we decode the string as foo and look at it, we get a Unicode string with Unicode escape characters for non-ASCII characters. The Python console (at least the one I'm using) doesn't implement a Unicode renderer and so it has to display the escape codes for the non-ASCII glyphs. However, if this same original string had been decoded by a web browser or text editor that did implement a Unicode renderer, you'd see all the correct glyphs (provided the necessary fonts were available!)

So, in the process of looking at these examples, we've introduced the one method and one function Python provides for encoding and decoding with Unicode strings:

.encode( [encoding] )	returns an encoded 8-bit string in the specified encoding (codec); if no encoding is specified, this method assumes the encoding in sys.getdefaultencoding()
unicode( [string], [encoding] )	decodes the supplied 8-bit string with the specified encoding (codec) and returns a unicode string; if no encoding is specified, this function assumes the encoding in sys.getdefaultencoding()

In Python 2.2 and later, there's also a symmetric method for decoding (available only for 8-bit strings):

.decode( [encoding] )

if the specified encoding is a unicode encoding, this method returns a unicode string as per the unicode function; if the specified encoding is not a unicode encoding (such as if you specifiy the zlib codec) this method returns another appropriate data type; if no encoding is specified, this function assumes the encoding in sys.getdefaultencoding()

One of the nifty things about Python's encoding and decoding functions is that it's really easy to convert between encodings. For example, if we start with the following UTF-16, we can easily convert it to UTF-8 by decoding the UTF-16 and re-encoding it as UTF-8.

>>> spanishString = unicode('\xff\xfe\xbf\x00Q\x00u\x00\xe9\x00 \x00p\x00a\x00s\x00a\x00?\x00','utf-16') # starting with utf-16
>>> spanishString.encode('utf-8') # translating to UTF-8
'\xc2\xbfQu\xc3\xa9 pasa?'

Your Application and Unicode

Now, let's take a step back and hypothesize a web application that has the following fundamental components:

a back-end database (PostGreSQL, for example)
some Webware servlets that include at least one form
Apache

You want this application to handle multi-lingual text, so you're going to take advantage of Unicode. The first thing you will probably want to do is set up a sitecustomize.py file in the Lib directory of your python installation and designate a Unicode encoding (probably UTF-8) as the default encoding for Python.

import sys
sys.setdefaultencoding("utf-8")

Important: as of Python 2.2, as far as I can tell, you can only call the setdefaultencoding method from within sitecustomize.py. You cannot perform this step from within your application! I don't understand why Guido set it up this way, but I'm sure he had his reasons.

This setting has a profound effect on python execution because your programs will all automatically encode Unicode strings to this encoding whenever:

a Unicode string is printed
a Unicode string is written to a file
a Unicode string is converted with str( )

You can, of course, bypass default encoding by manually encoding the string first with the .encode function, just as in the earlier examples.

If you don't set the default encoding to UTF-8, you will have to be rigorous about manually encoding Unicode data at appropriate times throughout your applications.

Note that the default encoding has little to do with decoding. (It merely serves as the default if you use the unicode function or decode method without specifying a codec.) You still must manually decode all encoded Unicode strings before you can use them. For example, if your servlet receives UTF-8 from a web browser POST, Apache will deliver that information as an ASCII string full of escape sequences, and your code will have to decode it as above with the unicode() function.

As of this writing, Webware does not meddle with decoding: it simply passes the POST through in the request object. If you are using dAlchemy's FormKit to handle web forms for your application, you can have FormKit automatically handle decoding. Otherwise, you need to find an appropriate place in your code to ensure that all incoming encoded Unicode gets decoded into Python Unicode objects before they get used for anything.

Encoding Hell

This brings up an important point that will haunt you as you start working with Unicode. It can be difficult to debug Unicode problems because one's development tools usually do not themselves implement Unicode rendering, or they only do so partially (which can be even worse!) You may not be able to trust what you see. For example, just because it looks "wrong" on the console doesn't mean it will look "wrong" in a web browser, properly decoded.

Now, when we try to print foo (above) in the console, which coerces the Unicode through the default encoding (UTF-8), we get a different kind of jibberish:

>>> print( foo )
The following are some random Cyrillic characters: Ð'Ð"Ð"Ð•Ð–Ð—Ð™Ð¤Ð®

Here, the escape codes in the UTF-8 are being incorrectly interpreted by the console as extended ASCII escape codes. The result is garbage. (Your results may vary depending on the console you're using.) Knowing that my Python console does support extended ASCII (basically Latin-1), I could try encoding it as Latin-1 and printing the result:

>>> print foo.encode('latin-1')
Traceback (most recent call last):
File "", line 1, in ?
UnicodeError: Latin-1 encoding error: ordinal not in range(256)

The encoding attempt fails with an exception because there are no Cyrillic characters in Latin-1! Basically, I'm out of luck.

On the other hand, because in another example from above I'm only using characters that appear in extended ASCII, I can print the following string in the PythonWin console:

>>> print spanishString.encode('latin-1')
¿Qué pasa?

But if I try the exact same thing in a "DOS box" console, which evidently uses a different character set, I get crud:

>>> print spanishString.encode('latin-1')
┐QuΘ pasa?

Browser Behavior

In order for your Unicode web pages you look right, you have to make sure that any information you serve to web browsers goes along with the instruction that they treat it as encoded Unicode (UTF-8 in most cases). There are a couple ways to do this. The best is to configure your web server to specify an encoding in the header it sends along with your page. With Apache, you do this by adding an AddDefaultCharset line to your httpd.conf (see http://httpd.apache.org/docs-2.0/mod/core.html#adddefaultcharset ), such as:

AddDefaultCharset utf-8

You can also embed tags in your pages that are intended to tip off the browser to the nature of the data. Such META tags are theoretically of a lower precedence than the web server's header, but they might prove useful for some browsers or situations.

< meta equiv="Content-Type" content="text/html; charset= utf-8">

You can easily verify whether your encoding directives are working by hitting your pages with a browser and then looking in the drop-down menus of the browser for the encoding option. If the correct encoding is selected (automatically) by your browser, then your header instructions are set properly.

If the browser is expecting the right encoding and your Python's default encoding is set to match, you can confidently write your Unicode string objects as output. For instance, with Webware, you simply use self.write() as normal, and whether your Python strings are ASCII or Unicode, the browser gets UTF-8 and correctly displays the results.

Convention dictates that a well-behaved browser will also return form input in whatever encoding you've specified for the page. That means that if you send a user a form on a UTF-8 page, whatever they type into the boxes will be returned to you in UTF-8. If it doesn't, you're in for an interesting ride, because most web browsers default to ISO-8859-1 (Latin-1) encoding, which is not actually a Unicode encoding, and is in any case incompatible with UTF-8. If you try to decode Latin-1 as UTF-8, you will raise an exception. For example:

>>> es3 = 'This is \xe4 string.' # ISO-8859-1/Latin-1 string
>>> es3.decode('utf-8')
Traceback (most recent call last):
File "", line 1, in ?
UnicodeError: UTF-8 decoding error: invalid data

Luckily, you can use Python's unicode() and .encode methods to translate to and from Latin-1, and you can use Python's try/except structure to prevent crashes. What you have to understand is that it's all left up to you, and that includes trapping any invalid data that tries to enter your program.

Databases

The last detail is the database. Every database has its unique handling of Unicode (or lack thereof.)

In theory, you can always store Unicode in its ASCII-encoded form in any relational database. The downside is that you're storing ASCII gobbledygook, so you will have an awkward time taking advantage of the powerful filtration features of the SQL language. If all you want to do is stash and retrieve data in bulk, this may not be a problem. However, if you ask the database more sophisticated questions, such as for a list of all the names that include "Björn," the database won't find any, unless you ask it to match "Bj\xc3\xb6rn" instead. You can probably work around this issue, but most modern relational databases are now supporting the storage and handling of Unicode transparently.

It happens that PostGreSQL (as of this writing) only supports UTF-8 natively in and out of the database, so that is what I use with it. Microsoft SQL Server – like everything else Microsoft makes – uses an elusive system called MBCS (Multi-byte Character System) which is built (exclusively) into Windows. Other RDBMS will have their own preferences. In my experience, the database itself isn't really much of an issue when it comes to Unicode. The issue is the middleware your application uses to communicate with that database.

With PostGreSQL, I use pyPgSQL as the database interface for my web applications. PyPgSQL does a lot for me with regard to Unicode. When properly configured, I can confidently rely on it to handle any Unicode encoding and decoding between my application and the database. That means I can INSERT and UPDATE data in the database with python Unicode strings and it just works. I can also SELECT from the database and I get back Unicode objects that I don't have to decode myself.

>>> from pyPgSQL import PgSQL
>>> db = PgSQL.connect( dsn=source,user=user,password=password,database=catalog, client_encoding=('utf-8','ignore'), unicode_results=1 )
>>> c = db.cursor()
>>> c.execute('SET CLIENT_ENCODING TO UNICODE')
>>> query = u'UPDATE myTable SET text = '%s' WHERE id=52;' % u'\xbfQu\xe9 pasa?' # copy some spanish into a cell
>>> c.execute(query)

With Microsoft SQL Server, I use ADO as my database interface. ADO performs similarly for SQL Server as pyPgSQL does for PostGreSQL, although ADO is only available for python applications running on win32.

>>> import pythoncom
>>> from win32com.client.dynamic import Dispatch
>>> pythoncom.CoInitialize() # mysterious good stuff
>>> connectionString = "Provider=SQLOLEDB.1;Persyst Security Info=False;User Id=%s;Password=%s;Initial Catalog=%s;Data Source=%s;" % ( user, password, catalog, source )
>>> db = Dispatch('ADODB.Connection')
>>> db.Open(connectionString)
>>> db.Execute(query)

Additional Reading

As far as I know, the definitive practical guide to Unicode is Richard Gillam's "Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard" (Addison Wesley, 2002). This book is fascinating, but only a few pages have any practical relevance to creating Web apps. It's a cool book, though.
The Unicode spec at http://unicode.org
Section 3.1.3 of the Python tutorial (essential reading) at http://python.org/doc/current/tut/tut.html
Another nice (but old) tutorial: http://www.reportlab.com/i18n/python_unicode_tutorial.html
The Python Enhancements Proposal for Unicode support in Python (interesting reading): http://www.python.org/peps/pep-0100.html
Unicode changes in 2.2: http://python.org/doc/current/whatsnew/node8.html