First of all, I am unable to load this model or a pipeline using “gpssohi/distilbart-qgen-6-6” as I get the message:
OSError: Can't load config for 'gpssohi/distilbart-qgen-6-6'. Make sure that:
- 'gpssohi/distilbart-qgen-6-6' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'gpssohi/distilbart-qgen-6-6' is the correct path to a directory containing a config.json file
This despite the instructions on the model card:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("gpssohi/distilbart-qgen-6-6")
model = AutoModelForSeq2SeqLM.from_pretrained("gpssohi/distilbart-qgen-6-6")
So I downloaded the model files locally and ran:
from transformers import BartTokenizer
tokenizer = BartTokenizer.from_pretrained("/pub/models/gpssohi/distilbart-qgen-6-6")
which produces the error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-4-9811fff8faaa>", line 2, in <module>
tokenizer = BartTokenizer.from_pretrained("/pub/models/gpssohi/distilbart-qgen-6-6")
File "/usr/local/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1428, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "/usr/local/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1575, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.6/site-packages/transformers/tokenization_roberta.py", line 174, in __init__
**kwargs,
File "/usr/local/lib/python3.6/site-packages/transformers/tokenization_gpt2.py", line 169, in __init__
super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
File "/usr/local/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 116, in __init__
super().__init__(**kwargs)
File "/usr/local/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1314, in __init__
super().__init__(**kwargs)
File "/usr/local/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 658, in __init__
"special token {} has to be either str or AddedToken but got: {}".format(key, type(value))
TypeError: special token bos_token has to be either str or AddedToken but got: <class 'dict'>
I did some spelunking through the code and found that bos_token (and its siblings) are loaded via file tokenizer_config.json, which contains:
{"unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "add_prefix_space": false, "errors": "replace", "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "model_max_length": 1024, "special_tokens_map_file": null, "name_or_path": "sshleifer/distilbart-cnn-6-6", "tokenizer_class": "BartTokenizer"}
This is loaded via json.load, resulting in the value of each token being (yup!), a dictionary! Now the value of the __type key for each token makes it seem like these are serialized AddedToken values, which, if properly reconstituted, would let this run without error.
Is this a known bug? If so, is there a fix/patch? Is there an alternative usage such that I can get past this and avoid having to “hack” Transformers code?