Training sentencePiece from scratch?

Johncwok · February 2, 2021, 4:07pm

Hi! I would like to train a sentencePiece tokenizer from scratch but I’m a bit lost from the documentation and don’t know where to start. There are already examples on how to train a BPE tokenizer on the huggingface website but I don’t know if I can simply transfer it 1 to 1. Also I don’t even know where to find the trainable class for sentencePiece.
Have you already trained a sentencePiece tokenizer?

finiteautomata · July 27, 2021, 2:45pm

@Johncwok check this page: Using tokenizers from 🤗 Tokenizers — transformers 4.7.0 documentation

You can train a SentencePiece tokenizer

from tokenizers import SentencePieceBPETokenizer

tokenizer = SentencePieceBPETokenizer()
tokenizer.train_from_iterator(
    text,
    vocab_size=30_000,
    min_frequency=5,
    show_progress=True,
    limit_alphabet=500,
)

and then just wrap it with a PreTrainedTokenizerFast

from transformers import PreTrainedTokenizerFast

transformer_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer
)

Documentation is not quite clear about this

jbmaxwell · May 16, 2022, 6:24pm

Sorry to jump on this old question, but I’m wondering how to properly get a tokenizer that I’ve trained this way loaded from file? I’ve been using the same approach you indicate here, but saving it with tokenizer.save('/path/to/tokenizer.json'), then loading it with the tokenizer_file option. It loads, but I’m noticing that it loads with all the special tokens set to None.

Also, although using from transformers import PreTrainedTokenizerFast does seem to actually work, the compiler indicates that it’s supposed to be from transformers.utils.dummy_tokenizers_objects instead… but that doesn’t actually seem to work.

It’s all a bit muddy. Any help appreciated.

jbmaxwell · May 16, 2022, 7:49pm

Okay, what seems to give me a fully configured, functioning Transformers tokenizer is:

special_tokens = ["<s>", "<pad>", "</s>", "<unk>", "<cls>", "<sep>", "<mask>"]
tk_tokenizer = SentencePieceBPETokenizer()
tk_tokenizer.train_from_iterator(
    text,
    vocab_size=4000,
    min_frequency=2,
    show_progress=True,
    special_tokens=special_tokens
)
tk_tokenizer.save(tokenizer_path)
# convert
tokenizer = transformers.PreTrainedTokenizerFast(tokenizer_object=tk_tokenizer, model_max_length=model_length, special_tokens=special_tokens)
tokenizer.bos_token = "<s>"
tokenizer.bos_token_id = tk_tokenizer.token_to_id("<s>")
tokenizer.pad_token = "<pad>"
tokenizer.pad_token_id = tk_tokenizer.token_to_id("<pad>")
tokenizer.eos_token = "</s>"
tokenizer.eos_token_id = tk_tokenizer.token_to_id("</s>")
tokenizer.unk_token = "<unk>"
tokenizer.unk_token_id = tk_tokenizer.token_to_id("<unk>")
tokenizer.cls_token = "<cls>"
tokenizer.cls_token_id = tk_tokenizer.token_to_id("<cls>")
tokenizer.sep_token = "<sep>"
tokenizer.sep_token_id = tk_tokenizer.token_to_id("<sep>")
tokenizer.mask_token = "<mask>"
tokenizer.mask_token_id = tk_tokenizer.token_to_id("<mask>")
# and save for later!
tokenizer.save_pretrained("./path/to/transformers/version/")

This seems to load properly using AutoTokenizer.from_pretrained() which should make life a lot easier.

appledora · January 16, 2023, 8:56pm

Thanks for this code snippet! How did you format the dataset (text) here? One sentence per line? I was also wondering whether I could train it iteratively for multiple chunks of text input?

jbmaxwell · January 16, 2023, 9:24pm

You’re welcome! I know it always helps to see code.

In my case it was just one line per sentence in a single flat text file.

derVokler · February 1, 2023, 2:23pm

Hi everyone,

I found this thread when I was looking for information how to best use a custom Sentencepiece model with the huggingface models / tokenizers. I found a different solution that works for me:

Train a Sentencepiece model with the Sentencepiece library
Load it one time into the tokenizer that I want
Save that tokenizer with .save_pretrained()

After that it can be loaded with .from_pretrained(). Here are the steps with a little more detail.

First, training the Sentencepiece model:

import sentencepiece as spm

spm.SentencePieceTrainer.Train(
    input=<FILE_NAME>,
    model_prefix='spModel',
    vocab_size=1000,
    pad_id=0,                
    unk_id=1,
    bos_id=2,
    eos_id=3,
    pad_piece='[PAD]',
    unk_piece='[UNK]',
    bos_piece='[CLS]',
    eos_piece='[SEP]',
    user_defined_symbols='[MASK]',
    model_type='unigram'
)

Now load it into a tokenizer, here a DebertaV2Tokenizer (why this needs to be the vocab file, I am not sure, but it works; I also specify the maximum length here):

from transformers import DebertaV2Tokenizer

tokenizer_deberta = DebertaV2Tokenizer(
    vocab_file  = "spModel.model",
    max_len = 512,
)

Now save as a pretrained tokenizer:

tokenizer_deberta.save_pretrained( PATH )

And from that point on you can load it as any pretrained tokenizer:

tokenizer_loaded = DebertaV2Tokenizer.from_pretrained(
    PATH
)

When I print that guy, it looks to me like all special tokens and the sequence length are correct:

DebertaV2Tokenizer(name_or_path=PATH, vocab_size=1000, model_max_length=512, is_fast=False, padding_side=‘right’, truncation_side=‘right’, special_tokens={‘bos_token’: ‘[CLS]’, ‘eos_token’: ‘[SEP]’, ‘unk_token’: ‘[UNK]’, ‘sep_token’: ‘[SEP]’, ‘pad_token’: ‘[PAD]’, ‘cls_token’: ‘[CLS]’, ‘mask_token’: ‘[MASK]’})

Maybe this still helps. I think the nice thing about this approach is that you can get a custom unigram tokenizer.

VS9205 · November 5, 2023, 11:05am

How did you format the dataset (text ) here?

jubueche · December 19, 2023, 10:23am

I couldn’t figure out how to pass an iterator to this function. Do you know how to do that? Otherwise one always has to create a file first.

Topic		Replies	Views
Tokenizer from tokenizers library cannot be used in transformers.Trainer 🤗Transformers	2	625	July 30, 2021
Training a tokenizer Beginners	1	445	August 3, 2022
Documentation of SentencePieceBPETokenizer? 🤗Tokenizers	0	827	May 2, 2024
Loading SentencePiece tokenizer Beginners	3	5029	October 24, 2023
Training Transformer XL from scratch Beginners	0	893	May 22, 2021

Training sentencePiece from scratch?

Related topics