Hi there,
I want to finetune a question answering model using the code provided by the Huggingface tutorial. I downloaded the model “vicuna-7b-v1.5.Q4_K_M.gguf” from “TheBloke/vicuna-7B-v1.5-GGUF · Hugging Face” including the config.json to the local folder “models/vicuna_7b” and ran the code
from datasets import load_dataset
squad = load_dataset("squad", split="train[:5000]")
squad = squad.train_test_split(test_size=0.2)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("models/vicuna_7b")
and got this error
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
c:\Users\xxx\Repositories\finetuning\finetuning.ipynb Cell 5 line 3
1 from transformers import AutoTokenizer
----> 3 tokenizer = AutoTokenizer.from_pretrained("models/vicuna_7b")
File c:\Users\xxx\Repositories\finetuning\.finetuning_venv\Lib\site-packages\transformers\models\auto\tokenization_auto.py:754, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
752 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
753 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 754 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
755 else:
756 if tokenizer_class_py is not None:
File c:\Users\xxx\Repositories\finetuning\.finetuning_venv\Lib\site-packages\transformers\tokenization_utils_base.py:1838, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
1832 logger.info(
1833 f"Can't load following files from cache: {unresolved_files} and cannot check if these "
1834 "files are necessary for the tokenizer to operate."
1835 )
1837 if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
-> 1838 raise EnvironmentError(
1839 f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
1840 "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
1841 f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "
1842 f"containing all relevant files for a {cls.__name__} tokenizer."
1843 )
1845 for file_id, file_path in vocab_files.items():
1846 if file_id not in resolved_vocab_files:
OSError: Can't load tokenizer for 'models/vicuna_7b'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'models/vicuna_7b' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.
I presume this is due to the AutoTokenizer. Since it is an Llama model I also tried the LlamaTokenizer resulting in the same error. Since I am not sure about the model yet, I also want to try finetuning this one “TheBloke/em_german_leo_mistral-GGUF · Hugging Face”.
Which are the right tokenizers for these models?
Is there some rule on which tokenizer to use for which model?
Regards