Hi! I would like to train a sentencePiece tokenizer from scratch but I’m a bit lost from the documentation and don’t know where to start. There are already examples on how to train a BPE tokenizer on the huggingface website but I don’t know if I can simply transfer it 1 to 1. Also I don’t even know where to find the trainable class for sentencePiece.
Have you already trained a sentencePiece tokenizer?
@Johncwok check this page: Using tokenizers from 🤗 Tokenizers — transformers 4.7.0 documentation
You can train a SentencePiece tokenizer
from tokenizers import SentencePieceBPETokenizer
tokenizer = SentencePieceBPETokenizer()
tokenizer.train_from_iterator(
text,
vocab_size=30_000,
min_frequency=5,
show_progress=True,
limit_alphabet=500,
)
and then just wrap it with a PreTrainedTokenizerFast
from transformers import PreTrainedTokenizerFast
transformer_tokenizer = PreTrainedTokenizerFast(
tokenizer_object=tokenizer
)
Documentation is not quite clear about this
Sorry to jump on this old question, but I’m wondering how to properly get a tokenizer that I’ve trained this way loaded from file? I’ve been using the same approach you indicate here, but saving it with tokenizer.save('/path/to/tokenizer.json')
, then loading it with the tokenizer_file
option. It loads, but I’m noticing that it loads with all the special tokens set to None
.
Also, although using from transformers import PreTrainedTokenizerFast
does seem to actually work, the compiler indicates that it’s supposed to be from transformers.utils.dummy_tokenizers_objects
instead… but that doesn’t actually seem to work.
It’s all a bit muddy. Any help appreciated.
Okay, what seems to give me a fully configured, functioning Transformers tokenizer is:
special_tokens = ["<s>", "<pad>", "</s>", "<unk>", "<cls>", "<sep>", "<mask>"]
tk_tokenizer = SentencePieceBPETokenizer()
tk_tokenizer.train_from_iterator(
text,
vocab_size=4000,
min_frequency=2,
show_progress=True,
special_tokens=special_tokens
)
tk_tokenizer.save(tokenizer_path)
# convert
tokenizer = transformers.PreTrainedTokenizerFast(tokenizer_object=tk_tokenizer, model_max_length=model_length, special_tokens=special_tokens)
tokenizer.bos_token = "<s>"
tokenizer.bos_token_id = tk_tokenizer.token_to_id("<s>")
tokenizer.pad_token = "<pad>"
tokenizer.pad_token_id = tk_tokenizer.token_to_id("<pad>")
tokenizer.eos_token = "</s>"
tokenizer.eos_token_id = tk_tokenizer.token_to_id("</s>")
tokenizer.unk_token = "<unk>"
tokenizer.unk_token_id = tk_tokenizer.token_to_id("<unk>")
tokenizer.cls_token = "<cls>"
tokenizer.cls_token_id = tk_tokenizer.token_to_id("<cls>")
tokenizer.sep_token = "<sep>"
tokenizer.sep_token_id = tk_tokenizer.token_to_id("<sep>")
tokenizer.mask_token = "<mask>"
tokenizer.mask_token_id = tk_tokenizer.token_to_id("<mask>")
# and save for later!
tokenizer.save_pretrained("./path/to/transformers/version/")
This seems to load properly using AutoTokenizer.from_pretrained()
which should make life a lot easier.
Thanks for this code snippet! How did you format the dataset (text
) here? One sentence per line? I was also wondering whether I could train it iteratively for multiple chunks of text input?
You’re welcome! I know it always helps to see code.
In my case it was just one line per sentence in a single flat text file.
Hi everyone,
I found this thread when I was looking for information how to best use a custom Sentencepiece model with the huggingface models / tokenizers. I found a different solution that works for me:
- Train a Sentencepiece model with the Sentencepiece library
- Load it one time into the tokenizer that I want
- Save that tokenizer with .save_pretrained()
After that it can be loaded with .from_pretrained(). Here are the steps with a little more detail.
First, training the Sentencepiece model:
import sentencepiece as spm
spm.SentencePieceTrainer.Train(
input=<FILE_NAME>,
model_prefix='spModel',
vocab_size=1000,
pad_id=0,
unk_id=1,
bos_id=2,
eos_id=3,
pad_piece='[PAD]',
unk_piece='[UNK]',
bos_piece='[CLS]',
eos_piece='[SEP]',
user_defined_symbols='[MASK]',
model_type='unigram'
)
Now load it into a tokenizer, here a DebertaV2Tokenizer (why this needs to be the vocab file, I am not sure, but it works; I also specify the maximum length here):
from transformers import DebertaV2Tokenizer
tokenizer_deberta = DebertaV2Tokenizer(
vocab_file = "spModel.model",
max_len = 512,
)
Now save as a pretrained tokenizer:
tokenizer_deberta.save_pretrained( PATH )
And from that point on you can load it as any pretrained tokenizer:
tokenizer_loaded = DebertaV2Tokenizer.from_pretrained(
PATH
)
When I print that guy, it looks to me like all special tokens and the sequence length are correct:
DebertaV2Tokenizer(name_or_path=PATH, vocab_size=1000, model_max_length=512, is_fast=False, padding_side=‘right’, truncation_side=‘right’, special_tokens={‘bos_token’: ‘[CLS]’, ‘eos_token’: ‘[SEP]’, ‘unk_token’: ‘[UNK]’, ‘sep_token’: ‘[SEP]’, ‘pad_token’: ‘[PAD]’, ‘cls_token’: ‘[CLS]’, ‘mask_token’: ‘[MASK]’})
Maybe this still helps. I think the nice thing about this approach is that you can get a custom unigram tokenizer.
How did you format the dataset (text
) here?
I couldn’t figure out how to pass an iterator to this function. Do you know how to do that? Otherwise one always has to create a file first.