Custom Dataset with Custom Tokenizer

I trained a BPE tokenizer using the wiki-text and now I’m trying to use this tokenizer on a custom dataset from a csv file. What I want to achieve is additional of feature columns in my dataset. But the dataset.map is giving error.

You should just use the tokenizer __call__: tokenizer(example["text"]).

When I train the tokenizer using this Quicktour — tokenizers documentation
The call function is not implemented

Oh, you should wrap your tokenizer in a PreTrainedTokenizerFact from the Transformers library (you can just pass your tokenizer with the tokenizer_object keyword argument).

1 Like