Dealing with spelling mistakes

I’m working a long text problem. I’m using longformer, transformers & pytorch.

There are many spelling, punctuation and grammatical errors in the texts. I have generated a csv that could be used as a dictionary. I have attached a screenshot of the head. You will note that words that have errors have a vector 0, and after they are corrected they have a vector 1. You may note for example on line 21, “all” is incorrectly captured as “.all.”. Prior to correction the count is 4. after correction, the count increases to 24,099.

There were 5 operations required to “clean up” all the errors.

I would like to use this to augment the pretrained tokenizer. My instinct is to use tokenizer.add_special_tokens(), however I have not been able to make this work.

I feel that this is clearly an issue, and dealing with it will improve performance, however I am unclear on how best to do this. What is the best way to deal with spelling, punctuation and grammatical mistakes? Am I on the right track or am i barking up the wrong tree? Any advice or guidance would be greatly appreciated.

Thanks in advance for answering my dumb noobie question :slight_smile: