🌑

Stephen's Blog

Spelling Corrector from Scratch

 

Stephen Cheng

Intro

To a entry-level NLP learner, an industrial-strength spell corrector are quite complex, but writing a toy spelling corrector from scratch that achieves 80% or 90% accuracy at a processing speed of tens of words per second in a few dozens of lines of code is possible.

Here is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('book.txt').read()))

def P(word, N=sum(WORDS.values())):
"Probability of `word`."
return WORDS[word] / N

def correction(word):
"Most probable spelling correction for word."
return max(candidates(word), key=P)

def candidates(word):
"Generate possible spelling corrections for word."
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words):
"The subset of `words` that appear in the dictionary of WORDS."
return set(w for w in words if w in WORDS)

def edits1(word):
"All edits that are one edit away from `word`."
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)

def edits2(word):
"All edits that are two edits away from `word`."
return (e2 for e1 in edits1(word) for e2 in edits1(e1))

The book.txt dataset could be any English e-book.

The function correction(word) returns a likely spelling correction:

1
2
3
4
5
>>> correction('speling')
'spelling'

>>> correction('korrectud')
'corrected'

How does it work?

The above function uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word. It then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list. Those words that are found more often in the frequency list are more likely the correct results.

The correction(A) function tries to choose the most likely spelling correction for A. There is no way to know for sure (for example, should “lates” be corrected to “late” or “latest” or “lattes” or …?), which suggests we use probabilities. We are trying to find the correction B, out of all possible candidate corrections, that maximizes the probability that B is the intended correction, given the original word A.

Reference

Peter Norvig’s blog

, — Jun 19, 2020

Search

    Made with ❤️ and ☀️ on Earth.