Training ALBERT from scratch with Distributed Training

glorentz · September 25, 2020, 1:34pm

Hello. I am currently trying to train an ALBERT model from scratch, using domain-specific data. I have around 4,8GB of text to use as a training dataset. I have at my disposal 2 nodes, each with 4 V100 GPUs. Here is my code:

import sentencepiece as spm
import transformers
import torch
import tokenizers
from nlp import load_dataset

dataset_f = "dataset.txt" # file that contains the dataset

""" IF  YOU WANT TO TRAIN A NEW TOKENIZER """
# spm.SentencePieceTrainer.train(input=dataset_f, model_prefix="tokenizer", vocab_size=30000)
""" IF  YOU ALREADY HAVE ONE """
tokenizer = transformers.AlbertTokenizer.from_pretrained("tokenizer.model")

config = transformers.AlbertConfig.from_pretrained("albert_config.json")
model = transformers.AlbertForMaskedLM(config=config)

dataset = load_dataset('nlp/text.py', data_files={'train':[dataset_f]})

dataset = dataset['train'].map(lambda sentence: tokenizer.__call__(sentence['text']),
                      batched=True)

dataset_mod = []
for item in dataset['input_ids']:
    dataset_mod.append(torch.tensor(item))

data_collator = transformers.DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = transformers.TrainingArguments(
    output_dir="./model",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=16,        
    save_steps=10000,
    save_total_limit=2,
)

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_mod,
    data_collator=data_collator,
    prediction_loss_only=True,
)


trainer.train()
trainer.save_model("./model")

I’m only using one epoch as to see if the process is working.
I then launch 8 processes (4 in each node, i.e 1 per GPU), using these commands:
1. SINGULARITYENV_CUDA_VISIBLE_DEVICES=$RANK
2. python -m torch.distributed.launch --nproc_per_node=1 --nnodes=8 --node_rank=$RANK --master_addr=$MASTER --master_port=$MASTER_PORT pretrain.py

From what I can observe using “nvidia-smi” command, this is not working (as I intended). What is happening is that each process is doing the whole code. And the processes aren’t operating concurrently, the first one starts and ends, then comes the second one, and so on.

I am using:

Python 3.6.9
transformers==3.0.2
tokenizers==0.8.1rc1
nlp==0.4.0
torch @ https://download.pytorch.org/whl/cu101/torch-1.4.0-cp36-cp36m-linux_x86_64.whl
torchvision @ https://download.pytorch.org/whl/cu101/torchvision-0.5.0-cp36-cp36m-linux_x86_64.whl
sentencepiece==0.1.91
pyarrow==1.0.1

Thank in advance for any help, it’s much appreciated.

Topic		Replies	Views
Albert LM on WikiText2 🤗Transformers	0	772	July 27, 2020
Albert Pre-training with Batch size 8 is throwing OOM 🤗Transformers	0	369	January 12, 2022
Pretraining ALBERT Intermediate	2	1335	February 16, 2022
Fine-tune, or train from scratch? Beginners	6	3441	September 16, 2020
Original Bert Pretraining Intermediate	0	541	January 10, 2022

Training ALBERT from scratch with Distributed Training

Related topics