Dealing with spelling mistakes

michael-kingston · February 23, 2022, 1:00pm

Hi,
I’m working a long text problem. I’m using longformer, transformers & pytorch.

There are many spelling, punctuation and grammatical errors in the texts. I have generated a csv that could be used as a dictionary. I have attached a screenshot of the head. You will note that words that have errors have a vector 0, and after they are corrected they have a vector 1. You may note for example on line 21, “all” is incorrectly captured as “.all.”. Prior to correction the count is 4. after correction, the count increases to 24,099.

There were 5 operations required to “clean up” all the errors.

I would like to use this to augment the pretrained tokenizer. My instinct is to use tokenizer.add_special_tokens(), however I have not been able to make this work.

I feel that this is clearly an issue, and dealing with it will improve performance, however I am unclear on how best to do this. What is the best way to deal with spelling, punctuation and grammatical mistakes? Am I on the right track or am i barking up the wrong tree? Any advice or guidance would be greatly appreciated.

Thanks in advance for answering my dumb noobie question

Topic		Replies	Views
2 possible bugs for adding new tokens to T5 🤗Transformers	3	1327	June 25, 2024
Adding tokens, but tokenizer doesn't use them 🤗Tokenizers	1	420	August 14, 2024
Can't load pre-trained tokenizer with additional new tokens 🤗Transformers	3	4439	August 10, 2021
Add_tokens + finetune 🤗Transformers	0	533	February 25, 2022
After vocabulary extension the tokenizer keeps on running 🤗Transformers	0	321	March 2, 2022

Dealing with spelling mistakes

Related topics