@valhalla I will definitely re-check but here’s what I did that I can remember.
Write a data loading script using this tutorial.
train_dataset = nlp.load_dataset(data_args.dataset_path, name='dummy', split=nlp.Split.TRAIN)
valid_dataset = nlp.load_dataset(data_args.dataset_path, name='dummy', split=nlp.Split.VALIDATION)
processor = DataProcessor(
tokenizer,
model_type=data_args.model_type,
max_source_length=data_args.max_source_length,
max_target_length=data_args.max_target_length
)
# DataProcessors implements all the necessary `map` (in a distributed manner) modules and `convert_to_features` function using `tokenizer` provided.
processor.process_all_maping_and_tokenization()
torch.save(train_dataset, train_path)
torch.save(valid_dataset, valid_path)
I actually took it from your data preparation here. But DataProcessor
is kind of different.
Later on, I tried with HF-trainer by,
train_dataset = torch.load(data_args.train_file_path)
valid_dataset = torch.load(data_args.valid_file_path)
data_collator = MyDataCollator(
tokenizer=tokenizer,
model_type=model_args.model_type,
mode="training",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=valid_dataset,
data_collator=data_collator,
... ...
... ...
)
Finally, run this by,
python -m torch.distributed.launch --nproc_per_node $NGPU train.py
... arguments ...
... arguments ...
... arguments ...
When I start my process, my job totally fails in an 8xV100 16GB Machine by overflowing RAM.
Is there anything I’m doing wrong?