Forward and reverse detokinizing

I am looking for some code examples on how to train with backbone transformers to both forward and reverse or detokenize.

The problem being that all my datasets (including validation) are pre-labelled for scoring and I believe it would help if I made the text more clean and similar to what the my pretrained transformers were trained on (wikipedia text, I believe). This means removing contractions, among other things.

The biggest challenge, of course, is the reverse trip and what techniques can be used to tokenize in such a way that we can appropriately map to and re-generate the original labels. The text is very messy and requires quite a bit of cleansing.

Any pointers would be greatly appreciated! Cheers

Another approach is using augmentation, eg: What’s in the Dataset object — datasets 1.11.0 documentation

The idea being there is that I train/predict on augmented datasets so that I retain the original. Is this a typical approach used?

Any feedback appreciated :slight_smile: