To download dataset click here.
When using this dataset, please cite our LREC 2020 paper, get the citation from here.
This dataset is created by inducing artificial typographical errors into existing datasets. It’s primarily targeted towards training supervised machine learning models for context-aware spelling correction and detection for typographical errors.
This dataset contains sentences from two existing datasets, with each sentence corrupted with artificially induced typo errors. In addition to clean and corrupt sentences, it also provides two different versions of corrections performed by a primitive spell corrector.
The typographical errors in our datasets are generated by a noise model based on real-world keystroke analysis. Our noise model is derived by statistics computed from an existing corpus of about 39k typo-corrected pairs. This method allows us to generate a large number of training samples for the task of typo correction or context-aware spelling correction without the need for annotating a large number of sentences by hand.
For detailed information on the error generation, read our paper here.