How long to expect training to take, and guidance on subset size?

sradc · April 3, 2023, 6:37am

I’ve trained a BPE tokenizer from scratch on bookcorpus+wikipedia, and it took 5.5 hours on the full dataset (it took ~1hr20min to ingest the text from the iterator). Vocab size ~30,000.

Is this the kind of time frame we’d expect? (I searched the internet, but there doesn’t seem to be much info on tokenizer training times for large datasets).
Is there any guidance on subsetting the data to train the tokenizer faster? I.e. what’s a recommended size of data to use? E.g. I’m thinking for now to reduce the dataset to 1/4 or something, to make the training quicker.

Edit 1:

I removed the normalizer I was using, and training is down to 40 minutes.

This was the normalizer I removed:

# Normalizer based on:
# https://github.com/JonasGeiping/cramming/blob/50bd06a65a4cd4a3dd6ee9ecce1809e1a9085374/cramming/data/tokenizer_preparation.py#L52
normalizer = normalizers.Sequence(
    [
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFD(),
        normalizers.Lowercase(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(" {2,}"), " "),
        normalizers.Replace(
            Regex(r"[^\x00-\x7F]+"), ""
        ),  # start from 00 instead of 1F to include tab
    ]
)

Hmm. I would like to use this normalizer still, but not for it to be so slow…

Edit 2:

Hmm, I switched to preprocessing the dataset with the normalizer (using Python’s re instead of the regex ones, which was faster).

However, now when trying to train the tokenizer on the dataset it seems to not be iterating through as quickly; and although it’s using multiple cpu cores, only 1 seems to be at 100% at a time, while the others are at 1-2% (this changes over time).

Whereas with the training mentioned above that took only 40 minutes, all the cores were near to 100% the whole time.

Checked the iteration speed of the mapped dataset, which seems just as fast as the original dataset. Not yet sure what the issue is.

maveriq · May 23, 2024, 5:44pm

Hi. Have you found any information on training times? I also observe that during pre-processing sequences step, only 1 core is activated. Is it expected?

Topic		Replies	Views
Speed up tokenizer training 🤗Tokenizers	5	1224	September 17, 2024
Tokenizer taking extremely long time to train 🤗Tokenizers	1	970	March 19, 2025
Running train_new_from_iterator to train a tokenizer is very slow 🤗Tokenizers	1	416	April 13, 2024
Why does tokenization take so long? 🤗Tokenizers	1	401	March 19, 2025
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2238	November 11, 2024

How long to expect training to take, and guidance on subset size?

Related topics