Training ALBERT from scratch with Distributed Training

Hello. I am currently trying to train an ALBERT model from scratch, using domain-specific data. I have around 4,8GB of text to use as a training dataset. I have at my disposal 2 nodes, each with 4 V100 GPUs. Here is my code:

import sentencepiece as spm
import transformers
import torch
import tokenizers
from nlp import load_dataset

dataset_f = "dataset.txt" # file that contains the dataset

""" IF  YOU WANT TO TRAIN A NEW TOKENIZER """
# spm.SentencePieceTrainer.train(input=dataset_f, model_prefix="tokenizer", vocab_size=30000)
""" IF  YOU ALREADY HAVE ONE """
tokenizer = transformers.AlbertTokenizer.from_pretrained("tokenizer.model")

config = transformers.AlbertConfig.from_pretrained("albert_config.json")
model = transformers.AlbertForMaskedLM(config=config)

dataset = load_dataset('nlp/text.py', data_files={'train':[dataset_f]})

dataset = dataset['train'].map(lambda sentence: tokenizer.__call__(sentence['text']),
                      batched=True)

dataset_mod = []
for item in dataset['input_ids']:
    dataset_mod.append(torch.tensor(item))

data_collator = transformers.DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = transformers.TrainingArguments(
    output_dir="./model",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=16,        
    save_steps=10000,
    save_total_limit=2,
)

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_mod,
    data_collator=data_collator,
    prediction_loss_only=True,
)


trainer.train()
trainer.save_model("./model")

I’m only using one epoch as to see if the process is working.
I then launch 8 processes (4 in each node, i.e 1 per GPU), using these commands:
1. SINGULARITYENV_CUDA_VISIBLE_DEVICES=$RANK
2. python -m torch.distributed.launch --nproc_per_node=1 --nnodes=8 --node_rank=$RANK --master_addr=$MASTER --master_port=$MASTER_PORT pretrain.py

From what I can observe using “nvidia-smi” command, this is not working (as I intended). What is happening is that each process is doing the whole code. And the processes aren’t operating concurrently, the first one starts and ends, then comes the second one, and so on.

I am using:

Thank in advance for any help, it’s much appreciated.