Is_pretokenized argument for tokenizer doesn't work?

jwa018 · September 18, 2020, 10:12am

Or maybe I’m misunderstanding how it’s supposed to be used. I’m tokenizing my whole text and then dividing the tokenized text into chunks of size 126 or less (all but the last chunk are of size 126). Then I send the list of chunks into the tokenizer:

batch_encoding = tokenizer(text_chunks, add_special_tokens=True, padding='max_length', max_length=max_seq_len, is_pretokenized=True, return_attention_mask=True)

What I want it to do is simply add the special tokens [CLS], [SEP], and [PAD] (if needed), and then turn the wordpieces into input_ids. Instead it does another tokenization with the wordpieces:

[‘ka’, ‘##vana’, ‘##ugh’]

is turned into

[‘ka’, ‘#’, ‘#’, ‘van’, ‘##a’, ‘#’, ‘#’, ‘u’, ‘##gh’].

sgugger · September 18, 2020, 1:37pm

Hi there! The argument is_pretokenized is for when your inputs have been pre-tokenized, that is, split into words (since the tokenizers of transformers are all subwords tokenizers). So you should pass your tokenizer ["I", "am", "talking", "about", "Kavanaugh"] with is_pretokenized=True for instance, not ["I", "am", "talk", "##ing", "about", "Ka", "##vana", "##ugh"].

If you just want to add the special tokens, the method you want to call is tokenizer.prepare_for_model I believe.

I could make something clearer in the docs but you’re far from being the first user that got confused, so I think we are going to rename this argument to something less confusing.

Topic		Replies	Views
Save tokenizer with argument 🤗Tokenizers	2	1961	October 26, 2022
How to tokenize input if I plan to train a Machine Translation model. I'm having difficulties with text_pair argument of Tokenizer() Beginners	4	1919	November 4, 2021
Tokenizer mapping the same token to multiple token_ids 🤗Tokenizers	4	660	April 22, 2024
How to customize behavior of added special tokens in a pretrained tokenizer? Intermediate	0	605	May 5, 2021
Pre_tokenization 🤗Transformers	0	331	April 13, 2023

Is_pretokenized argument for tokenizer doesn't work?

Related topics