Loading pretrained SentencePiece tokenizer from Fairseq

Hello. I have a pretrained RoBERTa model on fairseq, which contains dict.txt, model.pt, sentencepiece.bpe.model.

I have found a way to convert a fairseq checkpoint to huggingface format in https://github.com/huggingface/transformers/blob/master/src/transformers/convert_roberta_original_pytorch_checkpoint_to_pytorch.py

Howerver, I couldn’t find a similar method to convert the tokenizer in fairseq sentencepiece.bpe.model to huggingface’s format.
Is there any existing solution? Or do I have to convert it by myself?


@proxyht were you able to convert sentencepiece model to huggingface tokenizer. As I am facing similar issues as well.

A colleague of mine has figured out a way to work around this issue. Although both Huggingface and Fairseq use spm from google, the tokenizer in Fairseq map the id from spm to the token id in the dict.txt file, while Huggingface’s does not

We will have to write a custom Tokenizer in Huggingface to simulate the behavior as in Fairseq

Thanks. I will look into creating the custom tokenizer then using sentence piece model and tokenize data before passing it to training.

@proxyht seems like new version released today added support for loading sentenncepiece model. Details in this PR https://github.com/huggingface/tokenizers/pull/292 Haven’t tested myself yet. Plan to test in a week or so. Let me know how it goes if you test before.

For reference in tokenizer version 0.9.2 there is a function to load spm model. You need to run these two commands before to install the dependencies

pip install protobuf
wget https://raw.githubusercontent.com/google/sentencepiece/master/python/sentencepiece_model_pb2.py

then you can instantiate huggingface tokenizer using pretrain sentencepiece model as

tok = tokenizers.SentencePieceUnigramTokenizer.from_spm("spm.model")