pyo3_runtime.PanicException: likelihood is NAN. Input sentence may be too long

carted-ml · May 26, 2022, 12:34pm

We’re training a unigram tokenizer from scratch.

This is how it’s instantiated:

tokenizer = Tokenizer(Unigram())
tokenizer.normalizer = normalizers.Sequence(
	[normalizers.Replace("``", '"'), normalizers.Replace("''", '"')]
)
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

This is how the trainer is initialized:

trainer = UnigramTrainer(
        unk_token="<unk>",
        special_tokens=["[CLS]", "[SEP]", "<unk>", "<pad>", "[MASK]"],
        vocab_size=10000,
    )

We’re training on a batch by batch basis:

tokenizer.train_from_iterator(dataloader, trainer=trainer)

We’re running into the following toward the end:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
[04:52:53] Pre-processing sequences                 ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 50158000 /        0
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
[04:52:57] Pre-processing sequences                 ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 50164000 /        0
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
[04:53:25] Pre-processing sequences                 ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 0        /        0
[00:01:37] Suffix array seeds                       ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 18201510 / 18201510
[00:00:00] EM training                              ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 18201510 /       32
thread '<unnamed>' panicked at 'likelihood is NAN. Input sentence may be too long.', /__w/tokenizers/tokenizers/tokenizers/src/models/unigram/trainer.rs:413:17
Traceback (most recent call last):
  File "train_unigram_tokenizer.py", line 106, in <module>
    main(**args)
  File "train_unigram_tokenizer.py", line 92, in main
    tokenizer.train_from_iterator(datagen, trainer=trainer)
pyo3_runtime.PanicException: likelihood is NAN. Input sentence may be too long.

Here are the version info:

transformers: 4.17.0
tokenizers: 0.11.6

Since we’re using a private corpus I’m afraid I can not provide a notebook for the reproduction of the error.

nbroad · May 27, 2022, 1:33pm

I’m not too familiar with tokenizers, but did you do a check for really long sentences? The error you received makes it seem like inputting shorter sentences may solve the issue.

Topic		Replies	Views
Training unigram on long sequences 🤗Tokenizers	4	1275	June 23, 2022
Cannot Start the training loop because of bad size tokenization and/or for (presumably) custom dataset settings Beginners	2	307	June 11, 2022
Question on splitting input sequence Beginners	3	5572	June 14, 2022
Huggingface distilbert-base-uncased-finetuned-sst-2-english runs out of ram with only a few kb? Beginners	0	373	May 12, 2022
Tokenizer taking extremely long time to train 🤗Tokenizers	1	969	March 19, 2025

pyo3_runtime.PanicException: likelihood is NAN. Input sentence may be too long

Related topics