Friday, September 15, 2006

End to End Unicode Web Applications in Python

The following is a brief discussion of creating a web application with Python that uses Unicode. This discussion is not a thorough exploration of Unicode or Python's Unicode support. Rather, it is a purely practical overview that constitutes much of what most Python web developers need to know.

Python Unicode Objects

Unicode is a complex solution to a complex problem of meeting a simple need. The need is to permit software to handle the writing systems of (nearly) all the human languages of the world. The Unicode standard does this remarkably well, and most importantly, does it in such a way that you, the programmer, don't have to worry much about it.

What you do have to understand is that Unicode strings are multi-byte (binary) strings and therefore have some special requirements that ASCII strings do not. The good news is that you're using Python, which has a sensible approach to handling Unicode strings. Let's look at one:

>>> myString = 'This is a string' # this is a standard string
>>> myUnicodeString = u'This is a string' # this is a Unicode string

Python tries to treat Unicode strings as much like ASCII strings as possible. For the most part, if you have a Unicode string in Python, you can work with it exactly like you would an ASCII string. You can even mingle them. For example, if you concatenate the above variables, you'll get a Unicode string that looks like this:

>>> myString + myUnicode
u'This is a stringThis is a string'

Since the one string is Unicode, Python automatically translates the other to Unicode in the process of concatenation and returns a Unicode result. (Be sure to read section 3.1.3 of the Python tutorial for more examples and detail.) The great consequence here is that, internally, your code doesn't have to worry much about what's Unicode: it just works.

Encodings

So far, we've looked at Unicode strings as live objects in Python. They are straightforward enough. The trick is actually getting the Unicode string in the first place, or sending it somewhere else (to storage, for instance) once you're done with it.

Unicode in its native form will not pass through many common interfaces, such as HTTP, because those interfaces are only designed to work with 7- or 8-bit ASCII. Therefore, Unicode data is generally stored or transmitted through network systems in encoded form, as a string of ASCII characters. There are many possible ways to encode thusly. (The various encodings are documented in depth elsewhere.)

Encodings are a significant source of confusion for newcomers to Unicode. The common mistake is to think that an encoded string (of UTF-8, for instance) is the same thing as Unicode, when it's actually one of many possible ways to encode Unicode in ASCII form. There is only one Unicode. (You can play around with the Unicode database through Python's Unicodedata module.) There are many encodings, all of which point back to the one Unicode. Different encodings are more or less useful depending on your application.

In the web development context, there is only one encoding that will likely be of interest to you: UTF-8. For contrast, however, we will also look at UTF-16, another encoding that is particularly affiliated with XML. UTF-8 is the most common encoding in the web environment because it looks a lot like the ASCII equivalent of the text (at least until you start encountering extended characters or any of the thousands of glyphs that aren't part of ASCII). Consequently, UTF-8 is perceived as friendlier than UTF-16 or other encodings. More importantly, UTF-8 is the only Unicode encoding supported by most web browsers, although most web browsers support a large number of legacy non-Unicode encodings. On the other hand, UTF-16 looks like ASCII-encoded binary data. (Which it is.) Let's look at these two encodings.

>>> myUnicodeString.encode('utf-8')
'This is a string'
>>> myUnicodeString.encode('utf-16')
'\xff\xfeT\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00'

The important thing to note is that the result of calling the encode method is an ASCII string. We've taken a Unicode string and encoded it into ASCII that can be stored or transmitted through any mechanism that handles ASCII, like the Web.

For comparison, let's look at the encoded versions of the following string:

The following are some random Cyrillic characters: БГДЕЖЗЙФЮ

In UTF-8 (note the ASCII equivalents showing through):

The following are some random Cyrillic characters: \xd0\x91\xd0\x93\xd0\x94\xd0\x95\xd0\x96\xd0\x97\xd0\x99\xd0\xa4\xd0\xae

In UTF-16:

'\xff\xfeT\x00h\x00e\x00 \x00f\x00o\x00l\x00l\x00o\x00w\x00i\x00n\x00g\x00 \x00a\x00r\x00e\x00 \x00s\x00o\x00m\x00e\x00 \x00r\x00a\x00n\x00d\x00o\x00m\x00 \x00C\x00y\x00r\x00i\x00l\x00l\x00i\x00c\x00 \x00c\x00h\x00a\x00r\x00a\x00c\x00t\x00e\x00r\x00s\x00:\x00 \x00\x11\x04\x13\x04\x14\x04\x15\x04\x16\x04\x17\x04\x19\x04$\x04.\x04'

Now, let's decode these encoded strings in the python command line:

>>> foo = unicode('\xff\xfeT\x00h\x00e\x00 \x00f\x00o\x00l\x00l\x00o\x00w\x00i\x00n\x00g\x00 \x00a\x00r\x00e\x00 \x00s\x00o\x00m\x00e\x00 \x00r\x00a\x00n\x00d\x00o\x00m\x00 \x00C\x00y\x00r\x00i\x00l\x00l\x00i\x00c\x00 \x00c\x00h\x00a\x00r\x00a\x00c\x00t\x00e\x00r\x00s\x00:\x00 \x00\x11\x04\x13\x04\x14\x04\x15\x04\x16\x04\x17\x04\x19\x04$\x04.\x04','utf-16')
>>> foo
u'The following are some random Cyrillic characters: \u0411\u0413\u0414\u0415\u0416\u0417\u0419\u0424\u042e'

When we decode the string as foo and look at it, we get a Unicode string with Unicode escape characters for non-ASCII characters. The Python console (at least the one I'm using) doesn't implement a Unicode renderer and so it has to display the escape codes for the non-ASCII glyphs. However, if this same original string had been decoded by a web browser or text editor that did implement a Unicode renderer, you'd see all the correct glyphs (provided the necessary fonts were available!)

So, in the process of looking at these examples, we've introduced the one method and one function Python provides for encoding and decoding with Unicode strings:

.encode( [encoding] ) returns an encoded 8-bit string in the specified encoding (codec); if no encoding is specified, this method assumes the encoding in sys.getdefaultencoding()
unicode( [string], [encoding] ) decodes the supplied 8-bit string with the specified encoding (codec) and returns a unicode string; if no encoding is specified, this function assumes the encoding in sys.getdefaultencoding()

In Python 2.2 and later, there's also a symmetric method for decoding (available only for 8-bit strings):

.decode( [encoding] ) if the specified encoding is a unicode encoding, this method returns a unicode string as per the unicode function; if the specified encoding is not a unicode encoding (such as if you specifiy the zlib codec) this method returns another appropriate data type; if no encoding is specified, this function assumes the encoding in sys.getdefaultencoding()

One of the nifty things about Python's encoding and decoding functions is that it's really easy to convert between encodings. For example, if we start with the following UTF-16, we can easily convert it to UTF-8 by decoding the UTF-16 and re-encoding it as UTF-8.

>>> spanishString = unicode('\xff\xfe\xbf\x00Q\x00u\x00\xe9\x00 \x00p\x00a\x00s\x00a\x00?\x00','utf-16') # starting with utf-16
>>> spanishString.encode('utf-8') # translating to UTF-8
'\xc2\xbfQu\xc3\xa9 pasa?'

Your Application and Unicode

Now, let's take a step back and hypothesize a web application that has the following fundamental components:

  1. a back-end database (PostGreSQL, for example)
  2. some Webware servlets that include at least one form
  3. Apache

You want this application to handle multi-lingual text, so you're going to take advantage of Unicode. The first thing you will probably want to do is set up a sitecustomize.py file in the Lib directory of your python installation and designate a Unicode encoding (probably UTF-8) as the default encoding for Python.

import sys
sys.setdefaultencoding("utf-8")

Important: as of Python 2.2, as far as I can tell, you can only call the setdefaultencoding method from within sitecustomize.py. You cannot perform this step from within your application! I don't understand why Guido set it up this way, but I'm sure he had his reasons.

This setting has a profound effect on python execution because your programs will all automatically encode Unicode strings to this encoding whenever:

  1. a Unicode string is printed
  2. a Unicode string is written to a file
  3. a Unicode string is converted with str( )

You can, of course, bypass default encoding by manually encoding the string first with the .encode function, just as in the earlier examples.

If you don't set the default encoding to UTF-8, you will have to be rigorous about manually encoding Unicode data at appropriate times throughout your applications.

Note that the default encoding has little to do with decoding. (It merely serves as the default if you use the unicode function or decode method without specifying a codec.) You still must manually decode all encoded Unicode strings before you can use them. For example, if your servlet receives UTF-8 from a web browser POST, Apache will deliver that information as an ASCII string full of escape sequences, and your code will have to decode it as above with the unicode() function.

As of this writing, Webware does not meddle with decoding: it simply passes the POST through in the request object. If you are using dAlchemy's FormKit to handle web forms for your application, you can have FormKit automatically handle decoding. Otherwise, you need to find an appropriate place in your code to ensure that all incoming encoded Unicode gets decoded into Python Unicode objects before they get used for anything.

Encoding Hell

This brings up an important point that will haunt you as you start working with Unicode. It can be difficult to debug Unicode problems because one's development tools usually do not themselves implement Unicode rendering, or they only do so partially (which can be even worse!) You may not be able to trust what you see. For example, just because it looks "wrong" on the console doesn't mean it will look "wrong" in a web browser, properly decoded.

Now, when we try to print foo (above) in the console, which coerces the Unicode through the default encoding (UTF-8), we get a different kind of jibberish:

>>> print( foo )
The following are some random Cyrillic characters: Ð'Ð"Ð"ЕЖЗЙФЮ

Here, the escape codes in the UTF-8 are being incorrectly interpreted by the console as extended ASCII escape codes. The result is garbage. (Your results may vary depending on the console you're using.) Knowing that my Python console does support extended ASCII (basically Latin-1), I could try encoding it as Latin-1 and printing the result:

>>> print foo.encode('latin-1')
Traceback (most recent call last):
File "", line 1, in ?
UnicodeError: Latin-1 encoding error: ordinal not in range(256)

The encoding attempt fails with an exception because there are no Cyrillic characters in Latin-1! Basically, I'm out of luck.

On the other hand, because in another example from above I'm only using characters that appear in extended ASCII, I can print the following string in the PythonWin console:

>>> print spanishString.encode('latin-1')
¿Qué pasa?

But if I try the exact same thing in a "DOS box" console, which evidently uses a different character set, I get crud:

>>> print spanishString.encode('latin-1')
┐QuΘ pasa?

Browser Behavior

In order for your Unicode web pages you look right, you have to make sure that any information you serve to web browsers goes along with the instruction that they treat it as encoded Unicode (UTF-8 in most cases). There are a couple ways to do this. The best is to configure your web server to specify an encoding in the header it sends along with your page. With Apache, you do this by adding an AddDefaultCharset line to your httpd.conf (see http://httpd.apache.org/docs-2.0/mod/core.html#adddefaultcharset ), such as:

AddDefaultCharset utf-8

You can also embed tags in your pages that are intended to tip off the browser to the nature of the data. Such META tags are theoretically of a lower precedence than the web server's header, but they might prove useful for some browsers or situations.

< meta equiv="Content-Type" content="text/html; charset= utf-8">

You can easily verify whether your encoding directives are working by hitting your pages with a browser and then looking in the drop-down menus of the browser for the encoding option. If the correct encoding is selected (automatically) by your browser, then your header instructions are set properly.

If the browser is expecting the right encoding and your Python's default encoding is set to match, you can confidently write your Unicode string objects as output. For instance, with Webware, you simply use self.write() as normal, and whether your Python strings are ASCII or Unicode, the browser gets UTF-8 and correctly displays the results.

Convention dictates that a well-behaved browser will also return form input in whatever encoding you've specified for the page. That means that if you send a user a form on a UTF-8 page, whatever they type into the boxes will be returned to you in UTF-8. If it doesn't, you're in for an interesting ride, because most web browsers default to ISO-8859-1 (Latin-1) encoding, which is not actually a Unicode encoding, and is in any case incompatible with UTF-8. If you try to decode Latin-1 as UTF-8, you will raise an exception. For example:

>>> es3 = 'This is \xe4 string.' # ISO-8859-1/Latin-1 string
>>> es3.decode('utf-8')
Traceback (most recent call last):
File "", line 1, in ?
UnicodeError: UTF-8 decoding error: invalid data

Luckily, you can use Python's unicode() and .encode methods to translate to and from Latin-1, and you can use Python's try/except structure to prevent crashes. What you have to understand is that it's all left up to you, and that includes trapping any invalid data that tries to enter your program.

Databases

The last detail is the database. Every database has its unique handling of Unicode (or lack thereof.)

In theory, you can always store Unicode in its ASCII-encoded form in any relational database. The downside is that you're storing ASCII gobbledygook, so you will have an awkward time taking advantage of the powerful filtration features of the SQL language. If all you want to do is stash and retrieve data in bulk, this may not be a problem. However, if you ask the database more sophisticated questions, such as for a list of all the names that include "Björn," the database won't find any, unless you ask it to match "Bj\xc3\xb6rn" instead. You can probably work around this issue, but most modern relational databases are now supporting the storage and handling of Unicode transparently.

It happens that PostGreSQL (as of this writing) only supports UTF-8 natively in and out of the database, so that is what I use with it. Microsoft SQL Server – like everything else Microsoft makes – uses an elusive system called MBCS (Multi-byte Character System) which is built (exclusively) into Windows. Other RDBMS will have their own preferences. In my experience, the database itself isn't really much of an issue when it comes to Unicode. The issue is the middleware your application uses to communicate with that database.

With PostGreSQL, I use pyPgSQL as the database interface for my web applications. PyPgSQL does a lot for me with regard to Unicode. When properly configured, I can confidently rely on it to handle any Unicode encoding and decoding between my application and the database. That means I can INSERT and UPDATE data in the database with python Unicode strings and it just works. I can also SELECT from the database and I get back Unicode objects that I don't have to decode myself.

>>> from pyPgSQL import PgSQL
>>> db = PgSQL.connect( dsn=source,user=user,password=password,database=catalog, client_encoding=('utf-8','ignore'), unicode_results=1 )
>>> c = db.cursor()
>>> c.execute('SET CLIENT_ENCODING TO UNICODE')
>>> query = u'UPDATE myTable SET text = '%s' WHERE id=52;' % u'\xbfQu\xe9 pasa?' # copy some spanish into a cell
>>> c.execute(query)

With Microsoft SQL Server, I use ADO as my database interface. ADO performs similarly for SQL Server as pyPgSQL does for PostGreSQL, although ADO is only available for python applications running on win32.

>>> import pythoncom
>>> from win32com.client.dynamic import Dispatch
>>> pythoncom.CoInitialize() # mysterious good stuff
>>> connectionString = "Provider=SQLOLEDB.1;Persyst Security Info=False;User Id=%s;Password=%s;Initial Catalog=%s;Data Source=%s;" % ( user, password, catalog, source )
>>> db = Dispatch('ADODB.Connection')
>>> db.Open(connectionString)
>>> db.Execute(query)

Additional Reading

No comments: