Training unigram on long sequences

carted-ml · June 15, 2022, 11:52am

This is a follow-up thread of pyo3_runtime.PanicException: likelihood is NAN. Input sentence may be too long.

Background

We’re training a unigram tokenizer from scratch.

This is how it’s instantiated:

tokenizer = Tokenizer(Unigram())
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.Replace("``", '"'), normalizers.Replace("''", '"')]
)
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

This is how the trainer is initialized:

trainer = UnigramTrainer(
        unk_token="<unk>",
        special_tokens=["[CLS]", "[SEP]", "<unk>", "<pad>", "[MASK]"],
        vocab_size=10000,
)

We’re training on a batch-by-batch basis:

tokenizer.train_from_iterator(dataloader, trainer=trainer)

We’re running into the following toward the end:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
[04:52:53] Pre-processing sequences                 ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 50158000 /        0
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
[04:52:57] Pre-processing sequences                 ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 50164000 /        0
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
[04:53:25] Pre-processing sequences                 ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 0        /        0
[00:01:37] Suffix array seeds                       ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 18201510 / 18201510
[00:00:00] EM training                              ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 18201510 /       32
thread '<unnamed>' panicked at 'likelihood is NAN. Input sentence may be too long.', /__w/tokenizers/tokenizers/tokenizers/src/models/unigram/trainer.rs:413:17
Traceback (most recent call last):
  File "train_unigram_tokenizer.py", line 106, in <module>
    main(**args)
  File "train_unigram_tokenizer.py", line 92, in main
    tokenizer.train_from_iterator(datagen, trainer=trainer)
pyo3_runtime.PanicException: likelihood is NAN. Input sentence may be too long.

Here is the version info:

transformers: 4.17.0
tokenizers: 0.11.6

Here’s what tried in the meanwhile:

Used the enable_truncation() method that comes with a tokenizer. We set the length to a number for which we’re able to successfully train the tokenizer of another version of the dataset. But to our wonder, with the same sequence length and updated dataset version, the tokenizer training fails.
To better understand the issue, we first sampled all the very long sequences (character sequence lengths having > 300000). Note that we used our updated dataset (which is causing the issue) to sample the sequences. We can successfully train the tokenizer on those sampled entries.

We managed to train a BPE tokenizer on the same data on which the Unigram training is failing. Maybe there are some unhandled edge cases in the viterbi algorithm implementation?

I’m also re-iterating the fact that we did manage to train the Unigram tokenizer with the top 20% longest sequences from the data we have. This means that individual sequence lengths do not lead to a training failure. It fails only when we provide the full data. We have also ensured that the data doesn’t have any non ASCII characters.

ddeerreekk · June 22, 2022, 7:40am

Hi, I am facing the same error here and I wonder if you found a way to get around with it?

carted-ml · June 22, 2022, 1:22pm

Not yet

ddeerreekk · June 22, 2022, 2:11pm

So I changed the normalizer and the pretokenizer in their SentencePieceUnigramTokenizer implementation and it worked for me now. I didn’t try them separately but my guess is that the Metaspace pretokenizer caused the length error.

tokenizer.normalizer = normalizers.NFKC()
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

carted-ml · June 23, 2022, 4:12am

Thanks for sharing. We can give it a try.

Topic		Replies	Views
pyo3_runtime.PanicException: likelihood is NAN. Input sentence may be too long 🤗Tokenizers	1	1192	May 27, 2022
WordPiece tokenizer doesn't work for long sequences 🤗Tokenizers	1	395	March 28, 2024
Tokenizer Trainer Crashing 🤗Tokenizers	0	705	April 15, 2023
Unigram vocab_size doesn't fit 🤗Tokenizers	0	422	November 28, 2022
EM training on unigram tokenizer taking way longer than predicted 🤗Tokenizers	0	480	June 23, 2022

Training unigram on long sequences

Background

Related topics