Issue with sentencepiece tokenizer

Hi, I am using deepset haystack for question generation and get the following error.

Blockquote
File “/Users/kishoregarimella/Coding/workBrowDjango/env/lib/python3.10/site-packages/haystack/nodes/question_generator/question_generator.py”, line 59, in init
self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
File “/Users/kishoregarimella/Coding/workBrowDjango/env/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py”, line 597, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File “/Users/kishoregarimella/Coding/workBrowDjango/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py”, line 1783, in from_pretrained
return cls._from_pretrained(
File “/Users/kishoregarimella/Coding/workBrowDjango/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py”, line 1928, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File “/Users/kishoregarimella/Coding/workBrowDjango/env/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5_fast.py”, line 134, in init
super().init(
File “/Users/kishoregarimella/Coding/workBrowDjango/env/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py”, line 113, in init
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
File “/Users/kishoregarimella/Coding/workBrowDjango/env/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py”, line 1077, in convert_slow_tokenizer
return converter_class(transformer_tokenizer).converted()
File “/Users/kishoregarimella/Coding/workBrowDjango/env/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py”, line 426, in init
from .utils import sentencepiece_model_pb2 as model_pb2
File “/Users/kishoregarimella/Coding/workBrowDjango/env/lib/python3.10/site-packages/transformers/utils/sentencepiece_model_pb2.py”, line 92, in
_descriptor.EnumValueDescriptor(
File “/Users/kishoregarimella/Coding/workBrowDjango/env/lib/python3.10/site-packages/google/protobuf/descriptor.py”, line 755, in new
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:

  1. Downgrade the protobuf package to 3.20.x or lower.
  2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

Please let me know how to resolve this.

It looks like you need to install a different version of protobuf. Try running pip install protobuf==3.20.* in your python environment and then rerun your code.

1 Like

Generating the questions now, but it throws the following error:

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:997)>