Lazy-Loading binarized shard using Hf-dataset for Hf-Trainer

sbmaruf · June 24, 2021, 9:21am

I have been in a dilemma for training language models. I wanted to train a model with large amount (>5TB) of data. What I did is,

Write my own data loading script using this.
Binarize the data
Load the data using load_dataset module
Use huggingface trainer to train the model. At first, I wrote my own training loop. But since hf-trainer comes with deepspeed I decided to move into it.
But what happens that my binarize data becomes large and it cannot load the full data into the RAM (not even in 428GB RAM).

I was wondering is there any way I can shard the data and perform lazy loading? What I mean by Lazy-loading is to load a binarized shard into memory only when I train the model with a specific shard.

The tradeoff becomes, either I learn deepspeed and integrate into my training loop which seems a very hectic job, or try solving this issue with hf-dataset.

@patrickvonplaten @valhalla

valhalla · June 24, 2021, 9:29am

Hey there,

With datasets, all the pre-processed (binarized) data will be cached on the disk and will be loaded lazily by default. So you won’t run into any RAM issues at all.

So instead of first binarizing the data one could load it using datasets and then pre-process it using the .map method, this will cache the data which then you can load lazily.

You could refer to the training scripts in transformers examples.

Also, refer to the datasets dos

Hope this helps!

sbmaruf · June 24, 2021, 9:49am

@valhalla I will definitely re-check but here’s what I did that I can remember.

Write a data loading script using this tutorial.

train_dataset = nlp.load_dataset(data_args.dataset_path, name='dummy', split=nlp.Split.TRAIN)
valid_dataset = nlp.load_dataset(data_args.dataset_path, name='dummy', split=nlp.Split.VALIDATION)
processor = DataProcessor(
        tokenizer,
        model_type=data_args.model_type,
        max_source_length=data_args.max_source_length,
        max_target_length=data_args.max_target_length
    )
# DataProcessors implements all the necessary `map` (in a distributed manner) modules and `convert_to_features` function using `tokenizer` provided.
processor.process_all_maping_and_tokenization()

torch.save(train_dataset, train_path)
torch.save(valid_dataset, valid_path)

I actually took it from your data preparation here. But DataProcessor is kind of different.

Later on, I tried with HF-trainer by,

train_dataset = torch.load(data_args.train_file_path)
valid_dataset = torch.load(data_args.valid_file_path)

data_collator = MyDataCollator(
        tokenizer=tokenizer,
        model_type=model_args.model_type,
        mode="training",
    )
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        data_collator=data_collator,
        ... ... 
        ... ...
    )

Finally, run this by,

python -m torch.distributed.launch --nproc_per_node $NGPU train.py
... arguments ...
... arguments ...
... arguments ...

When I start my process, my job totally fails in an 8xV100 16GB Machine by overflowing RAM.

Is there anything I’m doing wrong?

valhalla · June 24, 2021, 10:12am

torch.load is the culprit here, it loads all data into memory. As I said above datasets already takes care of caching the dataset, so there is no need to use torch.save and torch.load.

If you manually want to save the dataset, have a look at this section

sbmaruf · June 24, 2021, 10:31am

Ahh…!!! Got it. Thanks a lot for this help.

Topic		Replies	Views
Big text dataset loading for training 🤗Datasets	2	115	May 7, 2025
How do i load part of the data set Beginners	3	91	May 5, 2025
Best practices for a large dataset 🤗Datasets	7	1502	May 6, 2025
Loading just part of dataset 🤗Datasets	4	4794	February 25, 2025
batched I/O from disk when load_dataset API is used? 🤗Datasets	2	27	January 27, 2025

Lazy-Loading binarized shard using Hf-dataset for Hf-Trainer

Related topics