I tried to use codet5-large for masked token prediction, but tokenizer cannot be loaded. How can I do it with codet5 (not t5+)?
from transformers import T5ForConditionalGeneration, AutoTokenizer, RobertaTokenizer, T5Tokenizer
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
path = "/hdd/codet5-large/" # I can use path , which was cloned from huggingface
tokenizer = T5Tokenizer.from_pretrained('Salesforce/codet5-large')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-large').to(device)
text = "public static void main(String[] args) {\n int <extra_id_0> = 42;\n System.out.println(<extra_id_1>);\n}"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(device)
masked_index_1 = torch.where(input_ids == tokenizer.mask_token_id)[1]
with torch.no_grad():
outputs = model.generate(input_ids)
# Convert the predicted token IDs back to tokens
predicted_tokens_1 = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
print("Predictions for <mask>:", predicted_tokens_1)
The error is the following:
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'RobertaTokenizer'.
The class this function is called from is 'T5Tokenizer'.
Traceback (most recent call last):
File "/hdd/token_augmentation/run_m.py", line 11, in <module>
tokenizer = T5Tokenizer.from_pretrained('Salesforce/codet5-large')
File "/hdd/venv_3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1841, in from_pretrained
return cls._from_pretrained(
File "/hdd/venv_3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2004, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/hdd/venv_3.10/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5.py", line 190, in __init__
self.sp_model.Load(vocab_file)
File "/hdd/venv_3.10/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/hdd/venv_3.10/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
I installed sentencepiece, the error is still the same, I tried to clone folder with git lfs
.
Can u help pls?