Lazy-Loading binarized shard using Hf-dataset for Hf-Trainer

I have been in a dilemma for training language models. I wanted to train a model with large amount (>5TB) of data. What I did is,

  1. Write my own data loading script using this.
  2. Binarize the data
  3. Load the data using load_dataset module
  4. Use huggingface trainer to train the model. At first, I wrote my own training loop. But since hf-trainer comes with deepspeed I decided to move into it.
  5. But what happens that my binarize data becomes large and it cannot load the full data into the RAM (not even in 428GB RAM).

I was wondering is there any way I can shard the data and perform lazy loading? What I mean by Lazy-loading is to load a binarized shard into memory only when I train the model with a specific shard.

The tradeoff becomes, either I learn deepspeed and integrate into my training loop which seems a very hectic job, or try solving this issue with hf-dataset.

@patrickvonplaten @valhalla

Hey there,

With datasets, all the pre-processed (binarized) data will be cached on the disk and will be loaded lazily by default. So you won’t run into any RAM issues at all.

So instead of first binarizing the data one could load it using datasets and then pre-process it using the .map method, this will cache the data which then you can load lazily.

You could refer to the training scripts in transformers examples.

Also, refer to the datasets dos

Hope this helps!

2 Likes

@valhalla I will definitely re-check but here’s what I did that I can remember.

Write a data loading script using this tutorial.

train_dataset = nlp.load_dataset(data_args.dataset_path, name='dummy', split=nlp.Split.TRAIN)
valid_dataset = nlp.load_dataset(data_args.dataset_path, name='dummy', split=nlp.Split.VALIDATION)
processor = DataProcessor(
        tokenizer,
        model_type=data_args.model_type,
        max_source_length=data_args.max_source_length,
        max_target_length=data_args.max_target_length
    )
# DataProcessors implements all the necessary `map` (in a distributed manner) modules and `convert_to_features` function using `tokenizer` provided.
processor.process_all_maping_and_tokenization()

torch.save(train_dataset, train_path)
torch.save(valid_dataset, valid_path)

I actually took it from your data preparation here. But DataProcessor is kind of different.

Later on, I tried with HF-trainer by,

train_dataset = torch.load(data_args.train_file_path)
valid_dataset = torch.load(data_args.valid_file_path)

data_collator = MyDataCollator(
        tokenizer=tokenizer,
        model_type=model_args.model_type,
        mode="training",
    )
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        data_collator=data_collator,
        ... ... 
        ... ...
    )

Finally, run this by,

python -m torch.distributed.launch --nproc_per_node $NGPU train.py
... arguments ...
... arguments ...
... arguments ...

When I start my process, my job totally fails in an 8xV100 16GB Machine by overflowing RAM.

Is there anything I’m doing wrong?

torch.load is the culprit here, it loads all data into memory. As I said above datasets already takes care of caching the dataset, so there is no need to use torch.save and torch.load.

If you manually want to save the dataset, have a look at this section

1 Like

Ahh…!!! Got it. Thanks a lot for this help.