Hi!
I work with the sberbank-ai/rugpt3large_based_on_gpt2 . It is the model for the russian language corpus.
I need to implement the function:
def score(sentence):
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
loss=model(tensor_input, lm_labels=tensor_input)
return math.exp(loss)
from ( #473)
For this function to work correctly with a single token, I need to add a special token <|endoftext|>.
But when I pass the string “<|endoftext|>token” to Toktokenizer, an error is returned “ValueError: type of None unknown: <class ‘NoneType’>. Should be one of a python, numpy, pytorch or tensorflow object”.
But even if there are multiple tokens at the input of tokenizer, there will be this error too.
Token <|endoftext|> is absend in the tokenizer dictionary. But in the dictionary of special tokens, this token is present.
My questions:
- What am I doing wrong?
- How can I solve this problem?
- Why is the <|endoftext|> absent in the dictionary?
This is short test shows my problem:
#!pip install transformers
from transformers import GPT2Tokenizer
from transformers import GPT2LMHeadModel
with torch.no_grad():
#view of special tokens in dicronary
print('#' * 20, ' view of special tokens in dictonary ', '#' * 20)
items = tokenizer.get_vocab().items()
for item in items:
if (item[0].startswith('<')==True) and (item[0].endswith('>')==True):
print(item)
#view of special tokens
print('#' * 20, 'map of special_tokens ', '#' * 20)
print(tokenizer.special_tokens_map)
#try to get the id's with <|endoftext|>
print('#' * 20, " try to get the id's with <|endoftext|> ", '#' * 20,)
single_token = 'вот'
single_token_with_eos = '<|endoftext|>' + single_token
#error here!
id = tokenizer.encode(single_token_with_eos, return_tensors='pt') print('single_token_with_eos_id',tokenizer.encode(single_token_with_eos))
Output:
#################### view of special tokens in dictonary ####################
('<pad>', 0)
('<s>', 1)
('</s>', 2)
('<unk>', 3)
('<mask>', 4)
> ################# map of special_tokens ####################
> {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}
> #################### try to get the id's with <|endoftext|> ####################
Error here!
...