Monday, April 09, 2007

python spell checker

How to Write a Spelling Corrector
In the past week, two friends (Dean and Bill) independently told me they were amazed at how Google does spelling correction so well and quickly. Type in a search like [speling] and Google comes back in 0.1 seconds or so with Did you mean: spelling. What surprised me is that I thought Dean and Bill, being highly accomplished engineers and mathematicians, would have good intuitions about statistical language processing problems such as spelling correction. But they didn't, and come to think of it, there's no reason they should. I figured they and many others could benefit from an explanation, and so on the plane back from my trip I wrote a toy spelling corrector, which I now share.

Let's get right to it. I figured that in less than a plane flight, and in less than a page of code, I could write a spelling corrector that achieves 80 or 90% accuracy at a rate of at least 10 words per second. And in fact, here, in 20 lines of Python 2.5 code, is the complete spelling corrector:

import re, string, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model

NWORDS = train(words(file('Documents/holmes.txt').read()))

def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + ## deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + ## transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in string.lowercase] + ## alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in string.lowercase]) ## insertion

def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
return max(known([word]) or known(edits1(word)) or known_edits2(word) or [word],
key=lambda w: NWORDS[w])

This defines the function correct, which takes a word as input and returns a likely correction of that word. For example:

>>> correct('speling')
'spelling'
>>> correct('korrecter')
'corrector'

1 comment:

Biz said...

Thank you for sharing this piece of code.
I may be trying to integrate GUI with this piece.