Forward and reverse detokinizing

blazespinnaker · December 24, 2021, 5:19am

I am looking for some code examples on how to train with backbone transformers to both forward and reverse or detokenize.

The problem being that all my datasets (including validation) are pre-labelled for scoring and I believe it would help if I made the text more clean and similar to what the my pretrained transformers were trained on (wikipedia text, I believe). This means removing contractions, among other things.

The biggest challenge, of course, is the reverse trip and what techniques can be used to tokenize in such a way that we can appropriately map to and re-generate the original labels. The text is very messy and requires quite a bit of cleansing.

Any pointers would be greatly appreciated! Cheers

blazespinnaker · December 26, 2021, 1:09am

Another approach is using augmentation, eg: What’s in the Dataset object — datasets 1.11.0 documentation

The idea being there is that I train/predict on augmented datasets so that I retain the original. Is this a typical approach used?

Any feedback appreciated

Topic		Replies	Views
Convert tokens and token-labels to string 🤗Transformers	7	7685	March 12, 2022
Am I doing this right? Beginners	1	511	July 12, 2020
No labels column for tokenized data 🤗Tokenizers	2	2278	June 27, 2022
Preprocessing for T5 Denoising Intermediate	1	2756	May 20, 2021
Detokenising output of Roberta tokeniser Beginners	0	454	April 6, 2022

Forward and reverse detokinizing

Related topics