Wav2vec2CTCTokenizer and vocab.json

picheny · October 28, 2022, 3:45pm

I am using Wav2Vec2CTCTokenizer.from_pretrained to read in the Facebook base librispeech model:

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(‘facebook/wav2vec2-base-960h’)

I am seeing some behavior I am not sure I follow. It seems that if I have a vocab.json file already in the same directory from where I am running the above command, it ignores the vocab.json file in the base model and uses the one in my directory. Is this correct, and if so, where is this happening in the source code - I cannot find it.

lianghsun · October 28, 2022, 11:00pm

github.com

huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L1570


      
          
          
    `tokenizer.get_vocab()[token]` is equivalent to `tokenizer.convert_tokens_to_ids(token)` when `token` is in the
              vocab.
          
          
    Returns:
                  `Dict[str, int]`: The vocabulary.
              """
              raise NotImplementedError()
          
          
@classmethod
          def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], *init_inputs, **kwargs):
              r"""
              Instantiate a [`~tokenization_utils_base.PreTrainedTokenizerBase`] (or a derived class) from a predefined
              tokenizer.
          
          
    Args:
                  pretrained_model_name_or_path (`str` or `os.PathLike`):
                      Can be either:
          
          
            - A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co.
                        Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a

picheny · October 29, 2022, 7:29pm

I am sorry, could you elaborate more? I still don’t see why it does not default to the base model vocab file - where/how does it wind up taking something from my directory.

Topic		Replies	Views
Unavailable wav2vec2 tokenizer Intermediate	0	488	December 10, 2021
Can't load tokenizer for 'facebook/wav2vec2-large-robust' 🤗Transformers	0	897	September 13, 2021
Facebook/wav2vec2-large-xlsr-53 on the hub: tokenizer issue 🤗Hub	4	4035	March 18, 2022
Error on creating the Wav2Vec2CTCTokenizer Beginners	0	293	October 6, 2022
Load tokenizer from vocab file that's been read into python Beginners	0	732	August 12, 2020

Wav2vec2CTCTokenizer and vocab.json

Related topics