Training sentencePiece from scratch?

Hi! I would like to train a sentencePiece tokenizer from scratch but I’m a bit lost from the documentation and don’t know where to start. There are already examples on how to train a BPE tokenizer on the huggingface website but I don’t know if I can simply transfer it 1 to 1. Also I don’t even know where to find the trainable class for sentencePiece.
Have you already trained a sentencePiece tokenizer?

5 Likes

@Johncwok check this page: Using tokenizers from 🤗 Tokenizers — transformers 4.7.0 documentation

You can train a SentencePiece tokenizer

from tokenizers import SentencePieceBPETokenizer

tokenizer = SentencePieceBPETokenizer()
tokenizer.train_from_iterator(
    text,
    vocab_size=30_000,
    min_frequency=5,
    show_progress=True,
    limit_alphabet=500,
)

and then just wrap it with a PreTrainedTokenizerFast

from transformers import PreTrainedTokenizerFast

transformer_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer
)

Documentation is not quite clear about this

5 Likes

Sorry to jump on this old question, but I’m wondering how to properly get a tokenizer that I’ve trained this way loaded from file? I’ve been using the same approach you indicate here, but saving it with tokenizer.save('/path/to/tokenizer.json'), then loading it with the tokenizer_file option. It loads, but I’m noticing that it loads with all the special tokens set to None.

Also, although using from transformers import PreTrainedTokenizerFast does seem to actually work, the compiler indicates that it’s supposed to be from transformers.utils.dummy_tokenizers_objects instead… but that doesn’t actually seem to work.

It’s all a bit muddy. Any help appreciated.

Okay, what seems to give me a fully configured, functioning Transformers tokenizer is:

special_tokens = ["<s>", "<pad>", "</s>", "<unk>", "<cls>", "<sep>", "<mask>"]
tk_tokenizer = SentencePieceBPETokenizer()
tk_tokenizer.train_from_iterator(
    text,
    vocab_size=4000,
    min_frequency=2,
    show_progress=True,
    special_tokens=special_tokens
)
tk_tokenizer.save(tokenizer_path)
# convert
tokenizer = transformers.PreTrainedTokenizerFast(tokenizer_object=tk_tokenizer, model_max_length=model_length, special_tokens=special_tokens)
tokenizer.bos_token = "<s>"
tokenizer.bos_token_id = tk_tokenizer.token_to_id("<s>")
tokenizer.pad_token = "<pad>"
tokenizer.pad_token_id = tk_tokenizer.token_to_id("<pad>")
tokenizer.eos_token = "</s>"
tokenizer.eos_token_id = tk_tokenizer.token_to_id("</s>")
tokenizer.unk_token = "<unk>"
tokenizer.unk_token_id = tk_tokenizer.token_to_id("<unk>")
tokenizer.cls_token = "<cls>"
tokenizer.cls_token_id = tk_tokenizer.token_to_id("<cls>")
tokenizer.sep_token = "<sep>"
tokenizer.sep_token_id = tk_tokenizer.token_to_id("<sep>")
tokenizer.mask_token = "<mask>"
tokenizer.mask_token_id = tk_tokenizer.token_to_id("<mask>")
# and save for later!
tokenizer.save_pretrained("./path/to/transformers/version/")

This seems to load properly using AutoTokenizer.from_pretrained() which should make life a lot easier.

3 Likes

Thanks for this code snippet! How did you format the dataset (text) here? One sentence per line? I was also wondering whether I could train it iteratively for multiple chunks of text input?

You’re welcome! I know it always helps to see code.

In my case it was just one line per sentence in a single flat text file.

1 Like

Hi everyone,

I found this thread when I was looking for information how to best use a custom Sentencepiece model with the huggingface models / tokenizers. I found a different solution that works for me:

  1. Train a Sentencepiece model with the Sentencepiece library
  2. Load it one time into the tokenizer that I want
  3. Save that tokenizer with .save_pretrained()

After that it can be loaded with .from_pretrained(). Here are the steps with a little more detail.

First, training the Sentencepiece model:

import sentencepiece as spm

spm.SentencePieceTrainer.Train(
    input=<FILE_NAME>,
    model_prefix='spModel',
    vocab_size=1000,
    pad_id=0,                
    unk_id=1,
    bos_id=2,
    eos_id=3,
    pad_piece='[PAD]',
    unk_piece='[UNK]',
    bos_piece='[CLS]',
    eos_piece='[SEP]',
    user_defined_symbols='[MASK]',
    model_type='unigram'
)

Now load it into a tokenizer, here a DebertaV2Tokenizer (why this needs to be the vocab file, I am not sure, but it works; I also specify the maximum length here):

from transformers import DebertaV2Tokenizer

tokenizer_deberta = DebertaV2Tokenizer(
    vocab_file  = "spModel.model",
    max_len = 512,
)

Now save as a pretrained tokenizer:

tokenizer_deberta.save_pretrained( PATH )

And from that point on you can load it as any pretrained tokenizer:

tokenizer_loaded = DebertaV2Tokenizer.from_pretrained(
    PATH
)

When I print that guy, it looks to me like all special tokens and the sequence length are correct:

DebertaV2Tokenizer(name_or_path=PATH, vocab_size=1000, model_max_length=512, is_fast=False, padding_side=‘right’, truncation_side=‘right’, special_tokens={‘bos_token’: ‘[CLS]’, ‘eos_token’: ‘[SEP]’, ‘unk_token’: ‘[UNK]’, ‘sep_token’: ‘[SEP]’, ‘pad_token’: ‘[PAD]’, ‘cls_token’: ‘[CLS]’, ‘mask_token’: ‘[MASK]’})

Maybe this still helps. I think the nice thing about this approach is that you can get a custom unigram tokenizer.

1 Like

How did you format the dataset (text ) here?

I couldn’t figure out how to pass an iterator to this function. Do you know how to do that? Otherwise one always has to create a file first.