Is_pretokenized argument for tokenizer doesn't work?

Or maybe I’m misunderstanding how it’s supposed to be used. I’m tokenizing my whole text and then dividing the tokenized text into chunks of size 126 or less (all but the last chunk are of size 126). Then I send the list of chunks into the tokenizer:

batch_encoding = tokenizer(text_chunks, add_special_tokens=True, padding='max_length', max_length=max_seq_len, is_pretokenized=True, return_attention_mask=True)

What I want it to do is simply add the special tokens [CLS], [SEP], and [PAD] (if needed), and then turn the wordpieces into input_ids. Instead it does another tokenization with the wordpieces:

[‘ka’, ‘##vana’, ‘##ugh’]

is turned into

[‘ka’, ‘#’, ‘#’, ‘van’, ‘##a’, ‘#’, ‘#’, ‘u’, ‘##gh’].

Hi there! The argument is_pretokenized is for when your inputs have been pre-tokenized, that is, split into words (since the tokenizers of transformers are all subwords tokenizers). So you should pass your tokenizer ["I", "am", "talking", "about", "Kavanaugh"] with is_pretokenized=True for instance, not ["I", "am", "talk", "##ing", "about", "Ka", "##vana", "##ugh"].

If you just want to add the special tokens, the method you want to call is tokenizer.prepare_for_model I believe.

I could make something clearer in the docs but you’re far from being the first user that got confused, so I think we are going to rename this argument to something less confusing.

2 Likes