Tokenizers for QA models Llama models vicuna-7B-v1.5-GGUF and em_german_leo_mistral-GGUF

Lue-C · November 9, 2023, 2:14pm

Hi there,

I want to finetune a question answering model using the code provided by the Huggingface tutorial. I downloaded the model “vicuna-7b-v1.5.Q4_K_M.gguf” from “TheBloke/vicuna-7B-v1.5-GGUF · Hugging Face” including the config.json to the local folder “models/vicuna_7b” and ran the code

from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")

squad = squad.train_test_split(test_size=0.2)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("models/vicuna_7b")

and got this error

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
c:\Users\xxx\Repositories\finetuning\finetuning.ipynb Cell 5 line 3
      1 from transformers import AutoTokenizer
----> 3 tokenizer = AutoTokenizer.from_pretrained("models/vicuna_7b")

File c:\Users\xxx\Repositories\finetuning\.finetuning_venv\Lib\site-packages\transformers\models\auto\tokenization_auto.py:754, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    752 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    753 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 754     return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    755 else:
    756     if tokenizer_class_py is not None:

File c:\Users\xxx\Repositories\finetuning\.finetuning_venv\Lib\site-packages\transformers\tokenization_utils_base.py:1838, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
   1832     logger.info(
   1833         f"Can't load following files from cache: {unresolved_files} and cannot check if these "
   1834         "files are necessary for the tokenizer to operate."
   1835     )
   1837 if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
-> 1838     raise EnvironmentError(
   1839         f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
   1840         "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
   1841         f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "
   1842         f"containing all relevant files for a {cls.__name__} tokenizer."
   1843     )
   1845 for file_id, file_path in vocab_files.items():
   1846     if file_id not in resolved_vocab_files:

OSError: Can't load tokenizer for 'models/vicuna_7b'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'models/vicuna_7b' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

I presume this is due to the AutoTokenizer. Since it is an Llama model I also tried the LlamaTokenizer resulting in the same error. Since I am not sure about the model yet, I also want to try finetuning this one “TheBloke/em_german_leo_mistral-GGUF · Hugging Face”.

Which are the right tokenizers for these models?
Is there some rule on which tokenizer to use for which model?

Regards

Topic		Replies	Views
Running GGUF model files using Auto classes 🤗Transformers	2	2398	March 2, 2024
KeyError During LLM Fine-Tuning - Error Related to Dataset Splits Beginners	0	289	April 27, 2024
Finetuned llama7b model is 5x slower than hugingface raw model 🤗Transformers	2	1522	October 5, 2023
Hugging Face Llama-2 (7b) taking too much time while inferencing Models	1	1493	June 23, 2024
How to use hugging face transformers for testing a dataset 🤗Transformers	1	266	May 4, 2024

Tokenizers for QA models Llama models vicuna-7B-v1.5-GGUF and em_german_leo_mistral-GGUF

Related topics