Lazy-Loading binarized shard using Hf-dataset for Hf-Trainer

@valhalla I will definitely re-check but here’s what I did that I can remember.

Write a data loading script using this tutorial.

train_dataset = nlp.load_dataset(data_args.dataset_path, name='dummy', split=nlp.Split.TRAIN)
valid_dataset = nlp.load_dataset(data_args.dataset_path, name='dummy', split=nlp.Split.VALIDATION)
processor = DataProcessor(
        tokenizer,
        model_type=data_args.model_type,
        max_source_length=data_args.max_source_length,
        max_target_length=data_args.max_target_length
    )
# DataProcessors implements all the necessary `map` (in a distributed manner) modules and `convert_to_features` function using `tokenizer` provided.
processor.process_all_maping_and_tokenization()

torch.save(train_dataset, train_path)
torch.save(valid_dataset, valid_path)

I actually took it from your data preparation here. But DataProcessor is kind of different.

Later on, I tried with HF-trainer by,

train_dataset = torch.load(data_args.train_file_path)
valid_dataset = torch.load(data_args.valid_file_path)

data_collator = MyDataCollator(
        tokenizer=tokenizer,
        model_type=model_args.model_type,
        mode="training",
    )
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        data_collator=data_collator,
        ... ... 
        ... ...
    )

Finally, run this by,

python -m torch.distributed.launch --nproc_per_node $NGPU train.py
... arguments ...
... arguments ...
... arguments ...

When I start my process, my job totally fails in an 8xV100 16GB Machine by overflowing RAM.

Is there anything I’m doing wrong?