Error with <|endoftext|> in Tokenizer GPT2

pivasil · December 15, 2020, 9:34pm

Hi!
I work with the sberbank-ai/rugpt3large_based_on_gpt2 . It is the model for the russian language corpus.
I need to implement the function:

def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    loss=model(tensor_input, lm_labels=tensor_input)
    return math.exp(loss)

from ( #473)
For this function to work correctly with a single token, I need to add a special token <|endoftext|>.
But when I pass the string “<|endoftext|>token” to Toktokenizer, an error is returned “ValueError: type of None unknown: <class ‘NoneType’>. Should be one of a python, numpy, pytorch or tensorflow object”.
But even if there are multiple tokens at the input of tokenizer, there will be this error too.
Token <|endoftext|> is absend in the tokenizer dictionary. But in the dictionary of special tokens, this token is present.
My questions:

What am I doing wrong?
How can I solve this problem?
Why is the <|endoftext|> absent in the dictionary?

This is short test shows my problem:

    #!pip install transformers
    from transformers import GPT2Tokenizer
    from transformers import GPT2LMHeadModel
    with torch.no_grad():
        #view of special tokens in dicronary
      print('#' * 20, ' view of special tokens in dictonary ', '#' * 20)
      items = tokenizer.get_vocab().items()
      for item in items:
        if (item[0].startswith('<')==True) and (item[0].endswith('>')==True):
          print(item)
      #view of special tokens
      print('#' * 20, 'map of special_tokens ', '#' * 20)
      print(tokenizer.special_tokens_map)
      #try to get the id's with <|endoftext|>
      print('#' * 20, " try to get the id's with <|endoftext|> ", '#' * 20,)
      single_token = 'вот'
      single_token_with_eos = '<|endoftext|>' + single_token
    #error here!
      id = tokenizer.encode(single_token_with_eos, return_tensors='pt')  print('single_token_with_eos_id',tokenizer.encode(single_token_with_eos))

Output:
####################  view of special tokens in dictonary  ####################
('<pad>', 0)
('<s>', 1)
('</s>', 2)
('<unk>', 3)
('<mask>', 4)
> ################# map of special_tokens  ####################
> {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}
> ####################  try to get the id's with <|endoftext|>  ####################
Error here!
...

pivasil · December 16, 2020, 6:59am

Sorry, I realized my mistake. It should have been like this:
tokenizer = AutoTokenizer.from_pretrained…
model = AutoModelWithLMHead.from_pretraine…

“If nothing works, then read the instructions”

pivasil · December 16, 2020, 9:30am

No, this is also not correct. < | endoftext / > is missing in this model. See https://github.com/sberbank-ai/ru-gpts/issues/28

Topic		Replies	Views
How to efficiently tokenize unknown tokens in GPT2 Intermediate	0	1014	January 12, 2022
transformers.Tokenizer produce unexpected results 🤗Transformers	0	208	April 26, 2023
Can't load tokenizer with added special tokens 🤗Transformers	0	836	March 29, 2022
Python crashes without error message when I try to use this custom tokenizer Beginners	1	934	December 3, 2021
Gpt2 token of specific string 🤗Transformers	0	299	March 30, 2023

Error with <|endoftext|> in Tokenizer GPT2

Related topics