Sotf-Masked BERT is a novel neural architecture to address the aforementioned issue, which consists of a network for error detection and a network for error correction based on BERT, with the former being connected to the latter with what we call soft-masking technique. The method uses ‘Soft-Masked BERT’ is general, and it may be employed in other language detection-correction problems not just focusing on CSC (Chinese Spelling error Correction) domain as it’s proposed in the original paper.
Soft-Masked BERT is composed of a detection network based on Bi-GRU and a correction network based on BERT. The detection network predicts the probabilities of errors and the correction network predicts the probabilities of error corrections, while the former passes its prediction results to the latter using soft masking.
The Model first creates an embedding for each character in the input sentence, referred to as input embedding. Next, it takes the sequence of embeddings as input and outputs the probabilities of errors for the sequence of characters (embeddings) using the detection network. After that it calculates the weighted sum of the input embeddings and [MASK] embeddings weighted by the error probabilities. The calculated embeddings mask the likely errors in the sequence in a soft way. Then it takes the sequence of soft-masked embeddings as input and outputs the probabilities of error corrections using the correction network, which is a BERT model whose final layer consists of a softmax function for all characters. There is also a residual connection between the input embeddings and the embeddings at the final layer.
Different with the original Sort-Masked BERT paper running models on Chinese dataset, here we modify a bit of code and use it in the English dataset.
The data that we will use for this project will be 20 popular books from Project Gutenberg.
1 | pip install -r requirements.txt |
The length of each sentence is between 4 and 200. So,
1 | max_len = 32 |
You can find the code on Github
1 | python data_prepare.py |
1 | python data_process.py |
1 | python train.py |
1 | python test.py |
BERT, Spelling-Correction — Sep 6, 2020
Made with ❤️ and ☀️ on Earth.