How long to expect training to take, and guidance on subset size?

I’ve trained a BPE tokenizer from scratch on bookcorpus+wikipedia, and it took 5.5 hours on the full dataset (it took ~1hr20min to ingest the text from the iterator). Vocab size ~30,000.

  • Is this the kind of time frame we’d expect? (I searched the internet, but there doesn’t seem to be much info on tokenizer training times for large datasets).

  • Is there any guidance on subsetting the data to train the tokenizer faster? I.e. what’s a recommended size of data to use? E.g. I’m thinking for now to reduce the dataset to 1/4 or something, to make the training quicker.

Edit 1:

I removed the normalizer I was using, and training is down to 40 minutes.

This was the normalizer I removed:

# Normalizer based on:
# https://github.com/JonasGeiping/cramming/blob/50bd06a65a4cd4a3dd6ee9ecce1809e1a9085374/cramming/data/tokenizer_preparation.py#L52
normalizer = normalizers.Sequence(
    [
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFD(),
        normalizers.Lowercase(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(" {2,}"), " "),
        normalizers.Replace(
            Regex(r"[^\x00-\x7F]+"), ""
        ),  # start from 00 instead of 1F to include tab
    ]
)

Hmm. I would like to use this normalizer still, but not for it to be so slow…

Edit 2:

Hmm, I switched to preprocessing the dataset with the normalizer (using Python’s re instead of the regex ones, which was faster).

However, now when trying to train the tokenizer on the dataset it seems to not be iterating through as quickly; and although it’s using multiple cpu cores, only 1 seems to be at 100% at a time, while the others are at 1-2% (this changes over time).

Whereas with the training mentioned above that took only 40 minutes, all the cores were near to 100% the whole time.

Checked the iteration speed of the mapped dataset, which seems just as fast as the original dataset. Not yet sure what the issue is.