Hello, I wasn’t sure if I should use the category transformers, datasets, or tokenizers for this, but wanted to post some benchmark times for training a GPT style tokenizer on a 10s of GB text dataset because they seem slower than my expectation (which could be totally off). The pre-processing sequences step took ~ 3 hours on a modern 12 core AMD CPU.
Here is the script I used
import datasets
from transformers import AutoTokenizer
def batch_iterator(dataset, batch_size=1_000):
for batch in dataset.iter(batch_size=batch_size):
yield batch["text"]
if __name__ == "__main__":
ds_id = "gabrielaltay/pubtator-central-bigbio-kb-2022-12-18"
clone_from_name = "gpt2"
vocab_size = 32_768
clone_from_tokenizer = AutoTokenizer.from_pretrained(clone_from_name)
ds_train = datasets.load_dataset(ds_id, split="train")
tokenizer = clone_from_tokenizer.train_new_from_iterator(
batch_iterator(ds_train),
vocab_size=vocab_size,
)
tokenizer.save_pretrained("pubtator-gpt2-v32k-tokenizer")
and here is the output,
python train_tokenizer.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Using custom data configuration gabrielaltay--pubtator-central-bigbio-kb-2022-12-18-51c5a8a315ecf808
Found cached dataset parquet (/home/galtay/.cache/huggingface/datasets/gabrielaltay___parquet/gabrielaltay--pubtator-central-bigbio-kb-2022-12-18-51c5a8a315ecf808/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
[02:55:09] Pre-processing sequences █████████████████████████████ 0 / 0
[00:00:07] Tokenize words █████████████████████████████ 6828518 / 6828518
[00:00:13] Count pairs █████████████████████████████ 6828518 / 6828518
[00:00:48] Compute merges █████████████████████████████ 32511 / 32511
The train split of the dataset is ~100GB but the text is duplicated in another column with markup so I estimate about 50GB in the “text” column. I think this should be doable at “training a tokenizer on english wikipedia speeds” within a factor of 10 or so (I was thinking minutes not hours). Can anyone see where I’m making a mistake or if my time estimates are just totally off?
I’m using,
datasets 2.8.0
transformers 4.25.1
and this is the dataset on the hub gabrielaltay/pubtator-central-bigbio-kb-2022-12-18 · Datasets at Hugging Face
thanks,
-G
UPDATE: attempting to isolate dataset iteration speed with
import datasets
from tqdm import tqdm
import datasets
def batch_iterator(dataset, batch_size=1_000):
for batch in dataset.iter(batch_size=batch_size):
yield batch["text"]
if __name__ == "__main__":
ds_id = "gabrielaltay/pubtator-central-bigbio-kb-2022-12-18"
ds_train = datasets.load_dataset(ds_id, split="train")
for batch in tqdm(batch_iterator(ds_train)):
x = 1
and getting,
700it [02:10, 5.18it/s]
leading me to believe the bottleneck is dataset iteration speed
(33M samples) / (batch size 1000) / (6 it/s) = 5500 s ~ 90 minutes