OSError: Unable to load vocabulary from file

Hi guys
sorry but i would be so thankful, if someone could have a look on my problem
i already read the other discussion and i didnt find my problem…

im a beginner in huggingface, so please be nice. But I already installed different modells, and most of it works fine. But not this one “google/mt5-base”

im loading with this:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(“google/mt5-base”)
model = AutoModel.from_pretrained(“google/mt5-base”)

and geht this error:
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.


OSError Traceback (most recent call last)
c:\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
1957 try:
→ 1958 tokenizer = cls(*init_inputs, **init_kwargs)
1959 except OSError:

c:\Anaconda3\lib\site-packages\transformers\models\t5\tokenization_t5.py in init(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, **kwargs)
153 self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
→ 154 self.sp_model.Load(vocab_file)
155

c:\Anaconda3\lib\site-packages\sentencepiece_init_.py in Load(self, model_file, model_proto)
366 return self.LoadFromSerializedProto(model_proto)
→ 367 return self.LoadFromFile(model_file)
368

c:\Anaconda3\lib\site-packages\sentencepiece_init_.py in LoadFromFile(self, arg)
170 def LoadFromFile(self, arg):
→ 171 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
172

OSError: Not found: “C:\Users\clööd/.cache\huggingface\hub\models–google–mt5-small\snapshots\38f23af8ec210eb6c376d40e9c56bd25a80f195d\spiece.model”: No such file or directory Error #2

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_3400\1414250741.py in
----> 1 tokenizer = AutoTokenizer.from_pretrained(“google/mt5-small”)
2 model = AutoModel.from_pretrained(“google/mt5-small”)

c:\Anaconda3\lib\site-packages\transformers\models\auto\tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
677 f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
678 )
→ 679 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
680
681 # Otherwise we have to be creative.

c:\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1802 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
1803
→ 1804 return cls._from_pretrained(
1805 resolved_vocab_files,
1806 pretrained_model_name_or_path,

c:\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
1832 has_tokenizer_file = resolved_vocab_files.get(“tokenizer_file”, None) is not None
1833 if (from_slow or not has_tokenizer_file) and cls.slow_tokenizer_class is not None:
→ 1834 slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
1835 copy.deepcopy(resolved_vocab_files),
1836 pretrained_model_name_or_path,

c:\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
1958 tokenizer = cls(*init_inputs, **init_kwargs)
1959 except OSError:
→ 1960 raise OSError(
1961 "Unable to load vocabulary from file. "
1962 “Please check that the provided vocabulary is accessible and not corrupted.”

OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

i tried so much:

  • delete cach folder of huggingface,
  • download file from model manually from git
  • downgrade python to 3.8.5
  • uninstall & reinstall transformers, huggingface-hub, tensorflow, torch

i dont understand why the installation for other models works, but not for this mt5?
thanks in advice for any(!) help…

infos:

  • transformers version: 4.27.3
  • Platform: Windows-10-10.0.19044-SP0
  • Python version: 3.9.13
  • Huggingface_hub version: 0.13.3
  • PyTorch version (GPU?): 2.0.0+cpu (False)
  • Tensorflow version (GPU?): 2.12.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No?
  • Using distributed or parallel set-up in script?: No?
  • sentencepiece: 0.1.97

Additional:
somehow i think the problem is around the Tokenizer.
Because when i try to load a model that works, e.g. “google/flan-t5-base”,
it only works with AutoTokenizer, not with T5Tokenizer…

Without being too sure here (i tried reading sentencepiece github but it’s all c++ extension-stuff), take a look at that path. Isn’t it strange that there is one sign of PosixPath (/.cache) and the rest is your windows? Maybe that’s why the file can not be found.

Yes it is strange, but i dont give the path, its automatically taken.
also the other models are working in the same path? i dont get it…