This is a follow-up thread of pyo3_runtime.PanicException: likelihood is NAN. Input sentence may be too long.
Background
Weβre training a unigram tokenizer from scratch.
This is how itβs instantiated:
tokenizer = Tokenizer(Unigram())
tokenizer.normalizer = normalizers.Sequence(
[normalizers.Replace("``", '"'), normalizers.Replace("''", '"')]
)
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
This is how the trainer is initialized:
trainer = UnigramTrainer(
unk_token="<unk>",
special_tokens=["[CLS]", "[SEP]", "<unk>", "<pad>", "[MASK]"],
vocab_size=10000,
)
Weβre training on a batch-by-batch basis:
tokenizer.train_from_iterator(dataloader, trainer=trainer)
Weβre running into the following toward the end:
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
[04:52:53] Pre-processing sequences ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 50158000 / 0
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
[04:52:57] Pre-processing sequences ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 50164000 / 0
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
[04:53:25] Pre-processing sequences ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 0 / 0
[00:01:37] Suffix array seeds ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 18201510 / 18201510
[00:00:00] EM training ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 18201510 / 32
thread '<unnamed>' panicked at 'likelihood is NAN. Input sentence may be too long.', /__w/tokenizers/tokenizers/tokenizers/src/models/unigram/trainer.rs:413:17
Traceback (most recent call last):
File "train_unigram_tokenizer.py", line 106, in <module>
main(**args)
File "train_unigram_tokenizer.py", line 92, in main
tokenizer.train_from_iterator(datagen, trainer=trainer)
pyo3_runtime.PanicException: likelihood is NAN. Input sentence may be too long.
Here is the version info:
- transformers: 4.17.0
- tokenizers: 0.11.6
Hereβs what tried in the meanwhile:
-
Used the
enable_truncation()
method that comes with a tokenizer. We set the length to a number for which weβre able to successfully train the tokenizer of another version of the dataset. But to our wonder, with the same sequence length and updated dataset version, the tokenizer training fails. -
To better understand the issue, we first sampled all the very long sequences (character sequence lengths having > 300000). Note that we used our updated dataset (which is causing the issue) to sample the sequences. We can successfully train the tokenizer on those sampled entries.
We managed to train a BPE tokenizer on the same data on which the Unigram training is failing. Maybe there are some unhandled edge cases in the viterbi algorithm implementation?
Iβm also re-iterating the fact that we did manage to train the Unigram tokenizer with the top 20% longest sequences from the data we have. This means that individual sequence lengths do not lead to a training failure. It fails only when we provide the full data. We have also ensured that the data doesnβt have any non ASCII characters.