How to disable system beep in windows
You can find and configure it under Device Manager|View|Show Hidden Devices|Non Plug and
Play|Beep|Action|Properties|Driver, then set the "Startup Type:" to "Disabled"
You can find and configure it under Device Manager|View|Show Hidden Devices|Non Plug and
Play|Beep|Action|Properties|Driver, then set the "Startup Type:" to "Disabled"
Labels: windows
One of the neat things about *nix is the ability to work with many different shells. As with anything else in the *nix world there is bound to be heated debates over which shell is better. Whether you want to use good ol' sh or (my personal favorite) bash you can change your default shell using the commands below.
Labels: linux
The traffic within Amazon EC2 and S3 is free, so you can have setups as funky as you wish. Remeber what it takes to build Flickr or Livejournal datacenter? Now you can do similar setups from home (unbelievable) and just let Amazon take care of the networking and hardware. This is so much more ‘WebOS’ than Google’s walled garden.
I applaud Amazon.
Please note that EC2 is not limited to web hosting applications, far from it. It makes even more sense to use it for virtual render farms, to run simulations and other tasks that require a lot of computing power but are usually executed only once in a while. So if you need for ex. 100 instance-hours to complete the computation, you can make your own cluster of 20 machines of similar power (will cost about $10000 for hardware alone) and complete the task in 5 hours, you can use EC2 to create this virtual cluster, compute and then shut it down when done and pay much less — $10 per job. Or you could use EC2 to create a 200 machine virtual cluster, complete the job in half an hour and pay the same $10 for it. Think about that.
Labels: amazon s3
Title: Completer with history viewer support and more features | | |
Description: This module let "tab" key can indent and completing valid python identifiers, keywords, and filenames. Source: Text Source # History.py Discussion: To show an example, I had input as below; |
Labels: python autocomplete
Download ipython
or run ipython
for windows
http://ipython.scipy.org/dist/ipython-0.7.2.win32.exe
Download Bitbucket
http://garnaat.org/bitbucket/BitBucket-0.3e.tar.gz
download 7zip and unzip it
use it to unzip bitbucket
and
right click on mycomputer and goto advanced tab, add c:\python\ or whichever dir python is python24 or python25 to path
and then open a new cmd window goto where bitbucket is unzipped
run python setup.py install
now you ve installed bitbucket, same method of installation for most of python packages
now open ipython
run
import bitbucket
bitbucket.c
press tab
it autocompletes
type it out as
a= bitbucket.connect('Access', 'Secret')
a is a bitbucket object to interact with S3
try out a. press tab
Labels: amazon s3 ipython bitbucket 7zip
it’s a ~10 times faster yum metadata parser that is reported to also use a lot less memory. This might make fedora suddenly useable on a whole bunch of my machines. I look forward to trying it out.
Should be really easy:
Labels: yum fedora
AJAX without XML Compares using XML, JavaScript Objects, and JSON
Labels: JSON
For the technically minded readers out there who want to get a look into the technical issues behind running a wildly popular SNS, here’s a link to a presentation given by the CTO at the MySQL Users Conference this year. I believe Batara Kesuma gave this presentation in Japan as well, as the content of the presentation is familiar and was covered by some of the local IT press. (I doubt they attended the conference in Santa Clara)
Labels: mysql
July 3, 2003 Copyright 2003 Shridhar Daithankar and Josh Berkus.
Authorized for re-distribution only under the PostgreSQL license (see www.postgresql.org/license).
1 Introduction
2 Some basic parameters
2.1 Shared buffers
2.2 Sort memory
2.3 Effective Cache Size
2.4 Fsync and the WAL files
3 Some less known parameters
3.1 random_ page_cost
3.2 Vacuum_ mem
3.3 max_fsm_pages
3.4 max fsm_ relations
3.5 wal_buffers
4 Other tips
4.1 Check your file system
4.2 Try the Auto Vacuum daemon
4.3 Try FreeBSD
5 The CONF Setting Guide
There are two important things for any performance optimization:
If you don't know your expected level of performance, you will end up chasing a carrot always couple of meters ahead of you. The performance tuning measures give diminishing returns after a certain threshold. If you don't set this threshold beforehand, you will end up spending lot of time for minuscule gains.
This document focuses entirely tuning postgresql.conf best for your existing setup. This is not the end of performance tuning. After using this document to extract the maximum reasonable performance from your hardware, you should start optimizing your application for efficient data access, which is beyond the scope of this article.
Databases are very bound to your system's I/O (disk) access and memory usage. As such, selection and configuration of disks, RAID arrays, RAM, operating system, and competition for these resources will have a profound effect on how fast your database is. We hope to have a later article covering this topic.
Your application also needs to be designed to access data efficiently, though careful query writing, planned and tested indexing, good connection management, and avoiding performance pitfalls particular to your version of PostgreSQL. Expect another guide someday helping with this, but really it takes several large books and years of experience to get it right ... or just a lot of time on the mailing lists.
If you are not done with your choice of OS for your server platform, consider BSD for this reason.
As noted in the worksheet, it covers PostgreSQL versions 7.3 and 7.4. If you are using an earlier version, you will not have access to all of these settings, and defaults and effects of some settings will be different.
Labels: postgres
There's a truly frightening amount of new options in the PostgreSQL.conf file. Even once-familiar options from the last 5 versions have changed names and parameter formats. It is intended to give you, the database administrator, more control, but can take some getting used to.
What follows are the settings that most DBAs will want to change, focused more on performance than anything else. There are quite a few "specialty" settings which most users won't touch, but those that use them will find indispensable. For those, you'll have to wait for the book.
Remember: PostgreSQL.conf settings must be uncommented to take effect, but re-commenting them does not necessarily restore the default values!
listen_addresses: Replaces both the tcp_ip and virtual_hosts settings from 7.4. Defaults to localhost in most installations, allowing only connections on the console. Many DBAs will want to set this to "*", meaning all available interfaces, after setting proper permissions in the pg_hba.conf file, in order to make PostgreSQL accessable to the network. As an improvment over previous versions, the "localhost" default does permit connections on the "loopback" interface, 127.0.0.1, enabling many server browser-based utilities.
max_connections: exactly like previous versions, this needs to be set to the actual number of simultaneous connections you expect to need. High settings will require more shared memory (shared_buffers). As the per-connection overhead, both from PostgreSQL and the host OS, can be quite high, it's important to use connection pooling if you need to service a large number of users. For example, 150 active connections on a medium-end single-processor 32-bit Linux server will consume significant system resources, and 600 is about the limit on that hardware. Of course, beefier hardware will allow more connections.
work_mem: used to be called sort_mem, but has been re-named since it now covers sorts, aggregates, and a few other operations. This is non-shared memory, which is allocated per-operation (one to several times per query); the setting is here to put a ceiling on the amount of RAM any single operation can grab before being forced to disk. This should be set according to a calculation based on dividing the available RAM (after applications and shared_buffers) by the expected maximum concurrent queries times the average number of memory-using operations per query.
Consideration should also be paid to the amount of work_mem needed by each query; processing large data sets requires more. Web database applications generally set this quite low, as the number of connections is high but queries are simple; 512K to 2048K generally suffices. Contrawise, decision support applications with their 160-line queries and 10 million-row aggregates often need quite a lot, as much as 500MB on a high-memory server. For mixed-use databases, this parameter can be set per connection, at query time, in order to give more RAM to specific queries.
maintenance_work_mem: formerly called vacuum_mem, this is the quantity of RAM PostgreSQL uses for VACUUM, ANALYZE, CREATE INDEX, and adding foriegn keys. You should raise it the larger your database tables are, and the more RAM you have to spare, in order to make these operations as fast as possible. A setting of 50% to 75% of the on-disk size of your largest table or index is a good rule, or 32MB to 256MB where this can't be determined.
checkpoint_segments: defines the on-disk cache size of the transaction log for write operations. You can ignore this in mostly-read web database, but for transaction processing databases or reporting databases involving large data loads, raising it is performance-critical. Depening on the volume of data, raise it to between 12 and 256 segments, starting conservatively and raising it if you start to see warning messages in the log. The space required on disk is equal to (checkpoint_segments * 2 + 1) * 16MB, so make sure you have enough disk space (32 means over 1GB).
max_fsm_pages: sizes the register which tracks partially empty data pages for population with new data; if set right, makes VACUUM faster and removes the need for VACUUM FULL or REINDEX. Should be slightly more than the total number of data pages which will be touched by updates and deletes between vacuums. The two ways to determine this number are to run VACUUM VERBOSE ANALYZE, or if using autovacuum (see below) set this according to the -V setting as a percentage of the total data pages used by your database. fsm_pages require very little memory, so it's better to be generous here.
vacuum_cost_delay: If you have large tables and a significant amount of concurrent write activity, you may want to make use of a new feature which lowers the I/O burden of VACUUMs at the cost of making them take longer. As this is a very new feature, it's a complex of 5 dependant settings for which we have only a few performance tests. Increasing vacuum_cost_delay to a non-zero value turns the feature on; use a reasonable delay, somewhere between 50 and 200ms. For fine tuning, increasing vacuum_cost_page_hit and decreasing vacuum_cost_page_limit will soften the impact of vacuums and make them take longer; in Jan Wieck's tests on a transaction processing test, a delay of 200, page_hit of 6 and limit of 100 decreased the impact of vacuum by more than 80% while tripling the execution time.
These settings allow the query planner to make accurate estimates of operation costs, and thus pick the best possible query plan. There are two global settings worth bothering with:
effective_cache_size: tells the query planner the largest possible database object that could be expected to be cached. Generally should be set to about 2/3 of RAM, if on a dedicated server. On a mixed-use server, you'll have to estimate how much of the RAM and OS disk cache other applications will be using and subtract that.
random_page_cost: a variable which estimates the average cost of doing seeks for index-fetched data pages. On faster machines, with faster disks arrays, this should be lowered, to 3.0, 2.5 or even 2.0. However, if the active portion of your database is many times larger than RAM, you will want to raise the factor back towards the default of 4.0. Alternatively, you can base adjustments on query performance. If the planner seems to be unfairly favoring sequential scans over index scans, lower it; if it's using slow indexes when it shouldn't, raise it. Make sure you test a variety of queries. Do not lower it below 2.0; if that seems necessary, you need to adjust in other areas, like planner statistics.
log_destination: this replaces the unintuitive syslog setting in prior versions. Your choices are to use the OS's administrative log (syslog or eventlog) or to use a seperate PostgreSQL log (stderr). The former is better for system monitoring; the latter, better for database troubleshooting and tuning.
redirect_stderr: If you decide to go with a seperate PostgreSQL log, this setting allows you to log to a file using a native PostgreSQL utility instead of command-line redirection, allowing automated log rotation. Set it to True, and then set log_directory to tell it where to put the logs. The default settings for log_filename, log_rotation_size, and log_rotation_age are good for most people.
As you tumble toward production on 8.0, you're going to want to set up a maintenance plan which includes VACUUMs and ANALYZEs. If your database involves a fairly steady flow of data writes, but does not require massive data loads and deletions or frequent restarts, this should mean setting up pg_autovacuum. It's better than time-scheduled vacuums because:
Setting up autovacuum requires an easy build of the module in the contrib/pg_autovacuum directory of your PostgreSQL source (Windows users should find autovaccuum included in the PGInstaller package). You turn on the stats configuration settings detailed in the README. Then you start autovacuum after PostgreSQL is started as a seperate process; it will shut down automatically when PostgreSQL shuts down.
The default settings for autovacuum are very conservative, though, and are more suitable for a very small database. I generally use something aggressive like:
-D -v 400 -V 0.4 -a 100 -A 0.3
This vacuums tables after 400 rows + 40% of the table has been updated or deleted, and analyzes after 100 rows + 30% of the table has been inserted, updated or deleted. The above configuration also lets me set my max_fsm_pages to 50% of the data pages in the database with confidence that that number won't be overrun, causing database bloat. We are currently testing various settings at OSDL and will have more hard figures on the above soon.
Note that you can also use autovacuum to set the Vacuum Delay options, instead of setting them in PostgreSQL.conf. Vacuum Delay can be vitally important for systems with very large tables or indexes; otherwise an untimely autovacuum call can halt an important operation.
There are, unfortunately, a couple of serious limitations to 8.0's autovacuum which will hopefully be eliminated in future versions:
Labels: postgres
PostgreSQL is the most advanced and flexible Open Source SQL database today. With this power and flexibility comes a problem. How do the PostgreSQL developers tune the default configuration for everyone? Unfortunately the answer is they can't.
The problem is that every database is not only different in its design, but also its requirements. Some systems are used to log mountains of data that is almost never queried. Others have essentially static data that is queried constantly, sometimes feverishly. Most systems however have some, usually unequal, level of reads and writes to the database. Add this little complexity on top of your totally unique table structure, data, and hardware configuration and hopefully you begin to see why tuning can be difficult.
The default configuration PostgreSQL ships with is a very solid configuration aimed at everyone's best guess as to how an "average" database on "average" hardware should be setup. This article aims to help PostgreSQL users of all levels better understand PostgreSQL performance tuning.
The first step to learning how to tune your PostgreSQL database is to understand the life cycle of a query. Here are the steps of a query:
The first step is the sending of the query string ( the actual SQL command you type in or your application uses ) to the database backend. There isn't much you can tune about this step, however if you have a very large queries that cannot be prepared in advance it may help to put them into the database as a stored procedure and cut the data transfer down to a minimum.
Once the SQL query is inside the database server it is parsed into tokens. This step can also be minimized by using stored procedures.
The planning of the query is where PostgreSQL really starts to do some work. This stage checks to see if the query is already prepared if your version of PostgreSQL and client library support this feature. It also analyzes your SQL to determine what the most efficient way of retrieving your data is. Should we use an index and if so which one? Maybe a hash join on those two tables is appropriate? These are some of the decisions the database makes at this point of the process. This step can be eliminated if the query is previously prepared.
Now that PostgreSQL has a plan of what it believes to be the best way to retrieve the data, it is time to actually get it. While there are some tuning options that help here, this step is mostly effected by your hardware configuration.
And finally the last step is to transmit the results to the client. While there aren't any real tuning options for this step, you should be aware that all of the data that you are returning is pulled from the disk and sent over the wire to your client. Minimizing the number of rows and columns to only those that are necessary can often increase your performance.
There are several postmaster options that can be set that drastically affect performance, below is a list of the most commonly used and how they effect performance:
Note that many of these options consume shared memory and it will probably be necessary to increase the amount of shared memory allowed on your system to get the most out of these options.
Obviously the type and quality of the hardware you use for your database server drastically impacts the performance of your database. Here are a few tips to use when purchasing hardware for your database server:
In general the more RAM and disk spindles you have in your system the better it will perform. This is because with the extra RAM you will access your disks less. And the extra spindles help spread the reads and writes over multiple disks to increase throughput and to reduce drive head congestion.
Another good idea is to separate your application code and your database server onto different hardware. Not only does this provide more hardware dedicated to the database server, but the operating system's disk cache will contain more PostgreSQL data and not other various application or system data this way.
For example, if you have one web server and one database server you can use a cross-over cable on a separate ethernet interface to handle just the web server to database network traffic to ensure you reduce any possible bottlenecks there. You can also obviously create an entirely different physical network for database traffic if you have multiple servers that access the same database server.
The most useful tool in tuning your database is the SQL command EXPLAIN ANALYZE. This allows you to profile each SQL query your application performs and see exactly how the PostgreSQL planner will process the query. Let's look at a short example, below is a simple table structure and query.
CREATE TABLE authors (
id int4 PRIMARY KEY,
name varchar
);
CREATE TABLE books (
id int4 PRIMARY KEY,
author_id int4,
title varchar
);
If we use the query:
EXPLAIN ANALYZE SELECT authors.name, books.title FROM books, authors
WHERE books.author_id=16 and authors.id = books.author_id ORDER BY books.title;
You will get output similar to the following:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Sort (cost=29.71..29.73 rows=6 width=64) (actual time=0.189..16.233 rows=7 loops=1)
Sort Key: books.title
-> Nested Loop (cost=0.00..29.63 rows=6 width=64) (actual time=0.068..0.129 rows=7 loops=1)
-> Index Scan using authors_pkey on authors (cost=0.00..5.82 rows=1 width=36) (actual time=0.029..0.033 rows=1 loops=1)
Index Cond: (id = 16)
-> Seq Scan on books (cost=0.00..23.75 rows=6 width=36) (actual time=0.026..0.052 rows=7 loops=1)
Filter: (author_id = 16)
Total runtime: 16.386 ms
You need to read this output from bottom to top when analyzing it. The first thing PostgreSQL does is do a sequence scan on the books table looking at each author_id column for values that equal 16. Then it does an index scan of the authors table, because of the implicit index created by the PRIMARY KEY options. The finally the results our sorted by books.title.
The values you see in parenthesis are the estimated and actual cost of that portion of the query. The closer together the estimate and the actual costs are the better performance you will typically see.
Now, let's change the structure a little bit by adding an index on books.author_id to avoid the sequence scan with this command:
CREATE INDEX books_idx1 on books(author_id);
If you rerun the query again, you won't see any noticeable change in the output. This is because PostgreSQL has not yet re-analyzed the data and determined that the new index may help for this query. This can be solved by running:
ANALYZE books;
However, in this small test case I'm working with the planner still favors the sequence scan because there aren't very many rows in my books table. If a query is going to return a large portion of a table then the planner chooses a sequence scan over an index because it is actually faster. You can also force PostgreSQL to favor index scans over sequential scans by setting the configuration parameter enable_seqscan to off. This doesn't remove all sequence scans, since some tables may not have an index, but it does force the planner's hand into always using an index scan when it is available. This is probably best done by sending the command SET enable_seqscan = off at the start of every connection rather than setting this option database wide. This way you can control via your application code when this is in effect. However, in general disabling sequence scans should only be used in tuning your application and is not really intended for every day use.
Typically the best way to optimize your queries is to use indexes on specific columns and combinations of columns to correspond to often used queries. Unfortunately this is done by trial and error. You should also note that increasing the number of indexes on a table increases the number of write operations that need to be performed for each INSERT and UPDATE. So don't do anything silly and just add indexes for each column in each table.
You can help PostgreSQL do what you want by playing with the level of statistics that are gathered on a table or column with the command:
ALTER TABLE ALTER COLUMNSET STATISTICS ;
This value can be a number between 0 and 1000 and helps PostgreSQL determine what level of statistics gathering should be performed on that column. This helps you to control the generated query plans without having slow vacuum and analyze operations because of generating large amounts of stats for all tables and columns.
Another useful tool to help determine how to tune your database is to turn on query logging. You can tell PostgreSQL which queries you are interested in logging via the log_statement configuration option. This is very useful in situations where you many users executing ad hoc queries to your system via something like Crystal Reports or via psql directly.
Sometimes the design and layout of your database affects performance. For example, if you have an employee database that looks like this:
CREATE TABLE employees (
id int4 PRIMARY KEY,
active boolean,
first_name varchar,
middle_name varchar,
last_name varchar,
ssn varchar,
address1 varchar,
address2 varchar,
city varchar,
state varchar(2),
zip varchar,
home_phone varchar,
work_phone varchar,
cell_phone varchar,
fax_phone varchar,
pager_number varchar,
business_email varchar,
personal_email varchar,
salary int4,
vacation_days int2,
sick_days int2,
employee_number int4,
office_addr_1 varchar,
office_addr_2 varchar,
office_city varchar,
office_state varchar(2),
office_zip varchar,
department varchar,
title varchar,
supervisor_id int4
);
This design is easy to understand, but isn't very good on several levels. While it will depend on your particular application, in most cases you won't need to access all of this data at one time. In portions of your application that deal with HR functions you are probably only interested in their name, salary, vacation time, and sick days. However, if the application displays an organization chart it would only be concerned with the department and supervisor_id portions of the table.
By breaking up this table into smaller tables you can get more efficient queries since PostgreSQL has less to read through, not to mention better functionality. Below is one way to make this structure better:
CREATE TABLE employees (
id int4 PRIMARY KEY,
active boolean,
employee_number int4,
first_name varchar,
middle_name varchar,
last_name varchar,
department varchar,
title varchar,
email varchar
);
CREATE TABLE employee_address (
id int4 PRIMARY KEY,
employee_id int4,
personal boolean,
address_1 varchar,
address_2 varchar,
city varchar,
state varchar(2),
zip varchar
);
CREATE TABLE employee_number_type (
id int4 PRIMARY KEY,
type varchar
);
CREATE TABLE employee_number (
id int4 PRIMARY KEY,
employee_id int4,
type_id int4,
number varchar
);
CREATE TABLE employee_hr_info (
id int4 PRIMARY KEY,
employee_id int4,
ssn varchar,
salary int4,
vacation_days int2,
sick_days int2
);
With this table structure the data associated with an employee is broken out into logical groupings. The main table contains the most frequently used information and the other tables store all of the rest of the information. The added benefit of this layout is that you can have any number of phone numbers and addresses associated with a particular employee now.
Another useful tip is to use partial indexes on columns where you typically query a certain value more often than another. Take for example the employee table above. You're probably only displaying active employees throughout the majority of the application, but creating a partial index on that column where the value is true can help speed up the query and may help the planner to choose to use the index in cases where it otherwise would not. You can create a partial index like this:
CREATE INDEX employee_idx2 ON employee(active) WHERE active='t';
Or you may have a situation where a row has a column named 'employee_id' that is null until the row is associated with an employee, maybe in some trouble ticket like system. In that type of application you would probably have a 'View Unassigned Tickets' portion of the application which would benefit from a partial index such as this:
CREATE INDEX tickets_idx1 ON tickets(employee_id) WHERE employee_id IS NULL;
There are many different ways to build applications which use a SQL database, but there are two very common themes that I will call stateless and stateful. In the area of performance there are different issues that impact each.
Stateless is typically the access type used by web based applications. Your software connects to the database, issues a couple of queries, returns to results to the user, and disconnects. The next action the users takes restarts this process with a new connect, new set of queries, etc.
Stateful applications are typically non-web based user interfaces where an application initiates a database connection and holds it open for the duration the application is in use.
In web based applications each time something is requested by the user , the application initiates a new database connection. While PostgreSQL has a very short connection creation time and in general it is not a very expensive operation, it is best to use some sort of database connection pooling method to get maximum performance.
There are several ways to accomplish database connection pooling, here is a short list of common ones:
It should be noted that in a few bizarre instances I've actually seen database connection pooling reduce the performance of web based applications. At a certain point the cost of handling the pooling is more expensive than simply creating a new connection. I suggest testing it both ways to see which is best for your environment.
When building stateful applications you should look into using database cursors via the DECLARE command. A cursor allows you to plan and execute a query, but only pull back the data as you need it, for example one row at a time. This can greatly increase the snappiness of the UI.
These issues typically effect both stateful and stateless applications in the same fashion. One good technique is to use server side prepared queries for any queries you execute often. This reduces the overall query time by caching the query plan for later use.
It should be noted however if you prepare a query in advance using placeholder values ( such as 'column_name = ?' ) then the planner will not always be able to choose the best plan. For example, your query has a placeholder for the boolean column 'active' and you have a partial index on false values the planner won't use it because it cannot be sure the value passed in on execution will be true or false.
You can also obviously utilize stored procedures here to reduce the transmit, parse, and plan portions of the typical query life cycle. It is best to profile your application and find commonly used queries and data manipulations and put them into a stored procedure.
Here is a short list of other items that may be of help.
Labels: postgres
What was your create statement?
You can always put a line in like:
web.debug(web.delete('todotable',int(todo_id), _test=True))
and run the resulting query to see what's wrong. The short version
seems to be that you didn't name your column 'id'.
Hum, I don't have any problems, and I don't do anything special ...
- I use PostgreSQL, so I created my database using UTF-8 encoding.
- my Python modules start with "# -*- coding: utf-8 -*-".
- all my modules and my templates are utf-8 encoded (I use Vim, so I
use ":set encoding=utf-8", but it should work with any good
text-editor).
The only 'encoding trick' I use is when I want to print an exception,
catched from a bad database query. I need to do something like this :
===============
except Exception, detail:
print "blablabla : %s" % str(detail).decode('latin1')
return
===============
... since the exception message (which is in french) seems to be latin1
encoded.
That's all :)
Jonathan
ps : it should be the same with sqlite
sudo yum remove kdemultimedia
sudo yum install kdemultimedia-kmix
http://rpmfind.net/linux/rpm2html/search.php?query=kdemultimedia-kmix
Unicode is a complex solution to a complex problem of meeting a simple need. The need is to permit software to handle the writing systems of (nearly) all the human languages of the world. The Unicode standard does this remarkably well, and most importantly, does it in such a way that you, the programmer, don't have to worry much about it.
What you do have to understand is that Unicode strings are multi-byte (binary) strings and therefore have some special requirements that ASCII strings do not. The good news is that you're using Python, which has a sensible approach to handling Unicode strings. Let's look at one:
Python tries to treat Unicode strings as much like ASCII strings as possible. For the most part, if you have a Unicode string in Python, you can work with it exactly like you would an ASCII string. You can even mingle them. For example, if you concatenate the above variables, you'll get a Unicode string that looks like this:
Since the one string is Unicode, Python automatically translates the other to Unicode in the process of concatenation and returns a Unicode result. (Be sure to read section 3.1.3 of the Python tutorial for more examples and detail.) The great consequence here is that, internally, your code doesn't have to worry much about what's Unicode: it just works.
So far, we've looked at Unicode strings as live objects in Python. They are straightforward enough. The trick is actually getting the Unicode string in the first place, or sending it somewhere else (to storage, for instance) once you're done with it.
Unicode in its native form will not pass through many common interfaces, such as HTTP, because those interfaces are only designed to work with 7- or 8-bit ASCII. Therefore, Unicode data is generally stored or transmitted through network systems in encoded form, as a string of ASCII characters. There are many possible ways to encode thusly. (The various encodings are documented in depth elsewhere.)
Encodings are a significant source of confusion for newcomers to Unicode. The common mistake is to think that an encoded string (of UTF-8, for instance) is the same thing as Unicode, when it's actually one of many possible ways to encode Unicode in ASCII form. There is only one Unicode. (You can play around with the Unicode database through Python's Unicodedata module.) There are many encodings, all of which point back to the one Unicode. Different encodings are more or less useful depending on your application.
In the web development context, there is only one encoding that will likely be of interest to you: UTF-8. For contrast, however, we will also look at UTF-16, another encoding that is particularly affiliated with XML. UTF-8 is the most common encoding in the web environment because it looks a lot like the ASCII equivalent of the text (at least until you start encountering extended characters or any of the thousands of glyphs that aren't part of ASCII). Consequently, UTF-8 is perceived as friendlier than UTF-16 or other encodings. More importantly, UTF-8 is the only Unicode encoding supported by most web browsers, although most web browsers support a large number of legacy non-Unicode encodings. On the other hand, UTF-16 looks like ASCII-encoded binary data. (Which it is.) Let's look at these two encodings.
The important thing to note is that the result of calling the encode method is an ASCII string. We've taken a Unicode string and encoded it into ASCII that can be stored or transmitted through any mechanism that handles ASCII, like the Web.
For comparison, let's look at the encoded versions of the following string:
In UTF-8 (note the ASCII equivalents showing through):
In UTF-16:
Now, let's decode these encoded strings in the python command line:
When we decode the string as foo and look at it, we get a Unicode string with Unicode escape characters for non-ASCII characters. The Python console (at least the one I'm using) doesn't implement a Unicode renderer and so it has to display the escape codes for the non-ASCII glyphs. However, if this same original string had been decoded by a web browser or text editor that did implement a Unicode renderer, you'd see all the correct glyphs (provided the necessary fonts were available!)
So, in the process of looking at these examples, we've introduced the one method and one function Python provides for encoding and decoding with Unicode strings:
.encode( [encoding] ) | returns an encoded 8-bit string in the specified encoding (codec); if no encoding is specified, this method assumes the encoding in sys.getdefaultencoding() |
unicode( [string], [encoding] ) | decodes the supplied 8-bit string with the specified encoding (codec) and returns a unicode string; if no encoding is specified, this function assumes the encoding in sys.getdefaultencoding() |
In Python 2.2 and later, there's also a symmetric method for decoding (available only for 8-bit strings):
.decode( [encoding] ) | if the specified encoding is a unicode encoding, this method returns a unicode string as per the unicode function; if the specified encoding is not a unicode encoding (such as if you specifiy the zlib codec) this method returns another appropriate data type; if no encoding is specified, this function assumes the encoding in sys.getdefaultencoding() |
One of the nifty things about Python's encoding and decoding functions is that it's really easy to convert between encodings. For example, if we start with the following UTF-16, we can easily convert it to UTF-8 by decoding the UTF-16 and re-encoding it as UTF-8.
Now, let's take a step back and hypothesize a web application that has the following fundamental components:
You want this application to handle multi-lingual text, so you're going to take advantage of Unicode. The first thing you will probably want to do is set up a sitecustomize.py file in the Lib directory of your python installation and designate a Unicode encoding (probably UTF-8) as the default encoding for Python.
Important: as of Python 2.2, as far as I can tell, you can only call the setdefaultencoding method from within sitecustomize.py. You cannot perform this step from within your application! I don't understand why Guido set it up this way, but I'm sure he had his reasons.
This setting has a profound effect on python execution because your programs will all automatically encode Unicode strings to this encoding whenever:
You can, of course, bypass default encoding by manually encoding the string first with the .encode function, just as in the earlier examples.
If you don't set the default encoding to UTF-8, you will have to be rigorous about manually encoding Unicode data at appropriate times throughout your applications.
Note that the default encoding has little to do with decoding. (It merely serves as the default if you use the unicode function or decode method without specifying a codec.) You still must manually decode all encoded Unicode strings before you can use them. For example, if your servlet receives UTF-8 from a web browser POST, Apache will deliver that information as an ASCII string full of escape sequences, and your code will have to decode it as above with the unicode() function.
As of this writing, Webware does not meddle with decoding: it simply passes the POST through in the request object. If you are using dAlchemy's FormKit to handle web forms for your application, you can have FormKit automatically handle decoding. Otherwise, you need to find an appropriate place in your code to ensure that all incoming encoded Unicode gets decoded into Python Unicode objects before they get used for anything.
This brings up an important point that will haunt you as you start working with Unicode. It can be difficult to debug Unicode problems because one's development tools usually do not themselves implement Unicode rendering, or they only do so partially (which can be even worse!) You may not be able to trust what you see. For example, just because it looks "wrong" on the console doesn't mean it will look "wrong" in a web browser, properly decoded.
Now, when we try to print foo (above) in the console, which coerces the Unicode through the default encoding (UTF-8), we get a different kind of jibberish:
Here, the escape codes in the UTF-8 are being incorrectly interpreted by the console as extended ASCII escape codes. The result is garbage. (Your results may vary depending on the console you're using.) Knowing that my Python console does support extended ASCII (basically Latin-1), I could try encoding it as Latin-1 and printing the result:
The encoding attempt fails with an exception because there are no Cyrillic characters in Latin-1! Basically, I'm out of luck.
On the other hand, because in another example from above I'm only using characters that appear in extended ASCII, I can print the following string in the PythonWin console:
But if I try the exact same thing in a "DOS box" console, which evidently uses a different character set, I get crud:
In order for your Unicode web pages you look right, you have to make sure that any information you serve to web browsers goes along with the instruction that they treat it as encoded Unicode (UTF-8 in most cases). There are a couple ways to do this. The best is to configure your web server to specify an encoding in the header it sends along with your page. With Apache, you do this by adding an AddDefaultCharset line to your httpd.conf (see http://httpd.apache.org/docs-2
You can also embed tags in your pages that are intended to tip off the browser to the nature of the data. Such META tags are theoretically of a lower precedence than the web server's header, but they might prove useful for some browsers or situations.
You can easily verify whether your encoding directives are working by hitting your pages with a browser and then looking in the drop-down menus of the browser for the encoding option. If the correct encoding is selected (automatically) by your browser, then your header instructions are set properly.
If the browser is expecting the right encoding and your Python's default encoding is set to match, you can confidently write your Unicode string objects as output. For instance, with Webware, you simply use self.write() as normal, and whether your Python strings are ASCII or Unicode, the browser gets UTF-8 and correctly displays the results.
Convention dictates that a well-behaved browser will also return form input in whatever encoding you've specified for the page. That means that if you send a user a form on a UTF-8 page, whatever they type into the boxes will be returned to you in UTF-8. If it doesn't, you're in for an interesting ride, because most web browsers default to ISO-8859-1 (Latin-1) encoding, which is not actually a Unicode encoding, and is in any case incompatible with UTF-8. If you try to decode Latin-1 as UTF-8, you will raise an exception. For example:
Luckily, you can use Python's unicode() and .encode methods to translate to and from Latin-1, and you can use Python's try/except structure to prevent crashes. What you have to understand is that it's all left up to you, and that includes trapping any invalid data that tries to enter your program.
The last detail is the database. Every database has its unique handling of Unicode (or lack thereof.)
In theory, you can always store Unicode in its ASCII-encoded form in any relational database. The downside is that you're storing ASCII gobbledygook, so you will have an awkward time taking advantage of the powerful filtration features of the SQL language. If all you want to do is stash and retrieve data in bulk, this may not be a problem. However, if you ask the database more sophisticated questions, such as for a list of all the names that include "Björn," the database won't find any, unless you ask it to match "Bj\xc3\xb6rn" instead. You can probably work around this issue, but most modern relational databases are now supporting the storage and handling of Unicode transparently.
It happens that PostGreSQL (as of this writing) only supports UTF-8 natively in and out of the database, so that is what I use with it. Microsoft SQL Server – like everything else Microsoft makes – uses an elusive system called MBCS (Multi-byte Character System) which is built (exclusively) into Windows. Other RDBMS will have their own preferences. In my experience, the database itself isn't really much of an issue when it comes to Unicode. The issue is the middleware your application uses to communicate with that database.
With PostGreSQL, I use pyPgSQL as the database interface for my web applications. PyPgSQL does a lot for me with regard to Unicode. When properly configured, I can confidently rely on it to handle any Unicode encoding and decoding between my application and the database. That means I can INSERT and UPDATE data in the database with python Unicode strings and it just works. I can also SELECT from the database and I get back Unicode objects that I don't have to decode myself.
With Microsoft SQL Server, I use ADO as my database interface. ADO performs similarly for SQL Server as pyPgSQL does for PostGreSQL, although ADO is only available for python applications running on win32.