I am looking for some code examples on how to train with backbone transformers to both forward and reverse or detokenize.
The problem being that all my datasets (including validation) are pre-labelled for scoring and I believe it would help if I made the text more clean and similar to what the my pretrained transformers were trained on (wikipedia text, I believe). This means removing contractions, among other things.
The biggest challenge, of course, is the reverse trip and what techniques can be used to tokenize in such a way that we can appropriately map to and re-generate the original labels. The text is very messy and requires quite a bit of cleansing.
Any pointers would be greatly appreciated! Cheers