Maybe there is a bug in BertTokenizer?

Sniper · March 19, 2021, 6:00am

I try to add a custom token in tokenizer.

I find this code in source code:

    Args:
        new_tokens (:obj:`List[str]`or :obj:`List[tokenizers.AddedToken]`):
            Token(s) to add in vocabulary. A token is only added if it's not already in the vocabulary (tested by
            checking if the tokenizer assign the index of the ``unk_token`` to them).
        special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not the tokens should be added as special tokens.

    Returns:
        :obj:`int`: The number of tokens actually added to the vocabulary.

    Examples::

        # Let's see how to increase the vocabulary of Bert model and tokenizer
        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        model = BertModel.from_pretrained('bert-base-uncased')

        num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
        print('We have added', num_added_toks, 'tokens')
        # Note: resize_token_embeddings expects to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
        model.resize_token_embeddings(len(tokenizer))
    """
    new_tokens = [str(tok) for tok in new_tokens]

but after I run the code
tokenizer.add_tokens('ss##e',special_tokens=True)
there is no change in special_tokens

I have tried for serval times

but it seem like is the same change weather True or False.

I notice that there is a special explain for albert.

    # Make sure we don't split on any special tokens (even they were already in the vocab before e.g. for Albert)
    if special_tokens:
        self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(new_tokens)))
    else:
        # Or on the newly added tokens
        self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(tokens_to_add)))

Is there a different between bert and albert that I don’t know? or there is someting wrong.

Topic		Replies	Views
`add_tokens` with argument `special_tokens=True` vs `add_special_tokens` 🤗Tokenizers	0	370	April 5, 2023
How to train the embedding of special token? Intermediate	1	4190	October 17, 2021
Add new tokens for subwords 🤗Tokenizers	9	6862	August 11, 2020
Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings Beginners	0	3780	April 21, 2021
Adding New Tokens - IndexError: index out of range in self Beginners	5	2725	June 17, 2021

Maybe there is a bug in BertTokenizer?

Related topics