Loading pretrained SentencePiece tokenizer from Fairseq

proxyht · October 1, 2020, 7:15am

Hello. I have a pretrained RoBERTa model on fairseq, which contains dict.txt, model.pt, sentencepiece.bpe.model.

I have found a way to convert a fairseq checkpoint to huggingface format in https://github.com/huggingface/transformers/blob/master/src/transformers/convert_roberta_original_pytorch_checkpoint_to_pytorch.py

Howerver, I couldn’t find a similar method to convert the tokenizer in fairseq sentencepiece.bpe.model to huggingface’s format.
Is there any existing solution? Or do I have to convert it by myself?

Thanks.

gurvinder · October 7, 2020, 9:01am

@proxyht were you able to convert sentencepiece model to huggingface tokenizer. As I am facing similar issues as well.

proxyht · October 7, 2020, 4:16pm

A colleague of mine has figured out a way to work around this issue. Although both Huggingface and Fairseq use spm from google, the tokenizer in Fairseq map the id from spm to the token id in the dict.txt file, while Huggingface’s does not

We will have to write a custom Tokenizer in Huggingface to simulate the behavior as in Fairseq

gurvinder · October 7, 2020, 8:12pm

Thanks. I will look into creating the custom tokenizer then using sentence piece model and tokenize data before passing it to training.

gurvinder · October 9, 2020, 6:52pm

@proxyht seems like new version released today added support for loading sentenncepiece model. Details in this PR https://github.com/huggingface/tokenizers/pull/292 Haven’t tested myself yet. Plan to test in a week or so. Let me know how it goes if you test before.

gurvinder · October 21, 2020, 12:55pm

For reference in tokenizer version 0.9.2 there is a function to load spm model. You need to run these two commands before to install the dependencies

pip install protobuf
wget https://raw.githubusercontent.com/google/sentencepiece/master/python/sentencepiece_model_pb2.py

then you can instantiate huggingface tokenizer using pretrain sentencepiece model as

tok = tokenizers.SentencePieceUnigramTokenizer.from_spm("spm.model")

Topic		Replies	Views
How to convert Fairseq model to huggingface transformer model Beginners	1	745	October 31, 2023
Training sentencePiece from scratch? 🤗Tokenizers	8	19245	December 19, 2023
Convert huggingface tokenizer into sentencepiece format 🤗Tokenizers	1	606	November 27, 2024
SentencePiece to Tokenizers conversion 🤗Tokenizers	0	77	March 14, 2025
Tuto on how to train a translation from scratch in a pythonic way? Beginners	2	619	October 23, 2023

Loading pretrained SentencePiece tokenizer from Fairseq

Related topics