Weβre training a unigram tokenizer from scratch.
This is how itβs instantiated:
tokenizer = Tokenizer(Unigram())
tokenizer.normalizer = normalizers.Sequence(
[normalizers.Replace("``", '"'), normalizers.Replace("''", '"')]
)
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
This is how the trainer is initialized:
trainer = UnigramTrainer(
unk_token="<unk>",
special_tokens=["[CLS]", "[SEP]", "<unk>", "<pad>", "[MASK]"],
vocab_size=10000,
)
Weβre training on a batch by batch basis:
tokenizer.train_from_iterator(dataloader, trainer=trainer)
Weβre running into the following toward the end:
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
[04:52:53] Pre-processing sequences ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 50158000 / 0
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
[04:52:57] Pre-processing sequences ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 50164000 / 0
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
[04:53:25] Pre-processing sequences ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 0 / 0
[00:01:37] Suffix array seeds ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 18201510 / 18201510
[00:00:00] EM training ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 18201510 / 32
thread '<unnamed>' panicked at 'likelihood is NAN. Input sentence may be too long.', /__w/tokenizers/tokenizers/tokenizers/src/models/unigram/trainer.rs:413:17
Traceback (most recent call last):
File "train_unigram_tokenizer.py", line 106, in <module>
main(**args)
File "train_unigram_tokenizer.py", line 92, in main
tokenizer.train_from_iterator(datagen, trainer=trainer)
pyo3_runtime.PanicException: likelihood is NAN. Input sentence may be too long.
Here are the version info:
-
transformers
:4.17.0
-
tokenizers
:0.11.6
Since weβre using a private corpus Iβm afraid I can not provide a notebook for the reproduction of the error.