Tokenizers for QA models Llama models vicuna-7B-v1.5-GGUF and em_german_leo_mistral-GGUF

Hi there,

I want to finetune a question answering model using the code provided by the Huggingface tutorial. I downloaded the model “vicuna-7b-v1.5.Q4_K_M.gguf” from “TheBloke/vicuna-7B-v1.5-GGUF · Hugging Face” including the config.json to the local folder “models/vicuna_7b” and ran the code

from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")

squad = squad.train_test_split(test_size=0.2)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("models/vicuna_7b")

and got this error

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
c:\Users\xxx\Repositories\finetuning\finetuning.ipynb Cell 5 line 3
      1 from transformers import AutoTokenizer
----> 3 tokenizer = AutoTokenizer.from_pretrained("models/vicuna_7b")

File c:\Users\xxx\Repositories\finetuning\.finetuning_venv\Lib\site-packages\transformers\models\auto\tokenization_auto.py:754, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    752 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    753 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 754     return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    755 else:
    756     if tokenizer_class_py is not None:

File c:\Users\xxx\Repositories\finetuning\.finetuning_venv\Lib\site-packages\transformers\tokenization_utils_base.py:1838, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
   1832     logger.info(
   1833         f"Can't load following files from cache: {unresolved_files} and cannot check if these "
   1834         "files are necessary for the tokenizer to operate."
   1835     )
   1837 if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
-> 1838     raise EnvironmentError(
   1839         f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
   1840         "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
   1841         f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "
   1842         f"containing all relevant files for a {cls.__name__} tokenizer."
   1843     )
   1845 for file_id, file_path in vocab_files.items():
   1846     if file_id not in resolved_vocab_files:

OSError: Can't load tokenizer for 'models/vicuna_7b'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'models/vicuna_7b' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

I presume this is due to the AutoTokenizer. Since it is an Llama model I also tried the LlamaTokenizer resulting in the same error. Since I am not sure about the model yet, I also want to try finetuning this one “TheBloke/em_german_leo_mistral-GGUF · Hugging Face”.

Which are the right tokenizers for these models?
Is there some rule on which tokenizer to use for which model?

Regards