To a entry-level NLP learner, an industrial-strength spell corrector are quite complex, but writing a toy spelling corrector from scratch that achieves 80% or 90% accuracy at a processing speed of tens of words per second in a few dozens of lines of code is possible.
defP(word, N=sum(WORDS.values())): "Probability of `word`." return WORDS[word] / N
defcorrection(word): "Most probable spelling correction for word." return max(candidates(word), key=P)
defcandidates(word): "Generate possible spelling corrections for word." return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
defknown(words): "The subset of `words` that appear in the dictionary of WORDS." return set(w for w in words if w in WORDS)
defedits1(word): "All edits that are one edit away from `word`." letters = 'abcdefghijklmnopqrstuvwxyz' splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [L + R[1:] for L, R in splits if R] transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] replaces = [L + c + R[1:] for L, R in splits if R for c in letters] inserts = [L + c + R for L, R in splits for c in letters] return set(deletes + transposes + replaces + inserts)
defedits2(word): "All edits that are two edits away from `word`." return (e2 for e1 in edits1(word) for e2 in edits1(e1))
The book.txt dataset could be any English e-book.
The function correction(word) returns a likely spelling correction:
1 2 3 4 5
>>> correction('speling') 'spelling'
>>> correction('korrectud') 'corrected'
How does it work?
The above function uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word. It then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list. Those words that are found more often in the frequency list are more likely the correct results.
The correction(A) function tries to choose the most likely spelling correction for A. There is no way to know for sure (for example, should “lates” be corrected to “late” or “latest” or “lattes” or …?), which suggests we use probabilities. We are trying to find the correction B, out of all possible candidate corrections, that maximizes the probability that B is the intended correction, given the original word A.