HuggingFace summarization training example notebook raises two warnings when run on multi-GPUs

AndreaSottana · June 29, 2022, 2:11pm

Hello,
I am running an example summarization training task taken from here (official HuggingFace example) on a multi-GPU machine, using the following versions: torch==1.11.0+cu113 and transformers==4.20.1. The only difference is that instead of using google/mt5-small as model I am using facebook/bart-base

I am getting two warnings. I believe they are raised when the model is trying to gather results from multiple but I’m struggling to understand whether they are normal or an indication that something isn’t working with the code. The first warning is

Parameter ‘function’=<function preprocess_function at 0x7f797c2de0e0> of the transform datasets.arrow_dataset.Dataset._map_single couldn’t be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won’t be showed.

The other warning is

venv/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.

Could someone tell me if these warnings are likely the result of something not working properly with the code and failing silently (code runs without raising any exceptions)?
For reference I copy below a small reproducible example of the code generating the warnings., adapted from the official HuggingFace notebook linked above.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
from datasets import load_dataset

model_checkpoint = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

english_dataset = load_dataset("amazon_reviews_multi", "en")

def preprocess_function(examples):
    model_inputs = tokenizer(examples["review_body"], max_length=512, truncation=True)
    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["review_title"], max_length=30, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = english_dataset.map(preprocess_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(
    english_dataset["train"].column_names
)
features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
data_collator(features)

batch_size = 8
num_train_epochs = 8
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-amazon-en-es",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

trainer.train()
trainer.evaluate()

Colorful · July 16, 2022, 9:56am

I have the same issue, do you resolve it?

AndreaSottana · July 16, 2022, 10:08am

Not yet unfortunately
Code seems to run fine but the warnings are still there and I’m not sure what should be done about them

Colorful · July 17, 2022, 6:59am

Hello, I find that this problem may be caused by trainer. When I used script run_glue.py, I got the UserWarning. But when I used script run_glue_no_trainer.py, everything is ok.

AndreaSottana · July 18, 2022, 11:24am

Thanks for pointing this out,. Unfortunately I am not using their training script but a custom one. What I would like to understand is why there is this issue and what underlining code part is causing it. Knowing that run_glue.py raises the warning but run_glue_no_trainer.py does not, may be helpful if you plan to use those, but does not really address my query as to why this is happening. For this reason I will keep the question open

brando · August 17, 2022, 3:38pm

did this help you? Using Transformers with DistributedDataParallel — any examples?

Topic		Replies	Views
Getting error while resuming the training with a single GPU Beginners	1	745	June 13, 2024
Cannot use Datasets.map on multi-gpu during evaluation Beginners	3	3047	July 1, 2024
Using huggingface transformers trainer method for hugging face datasets 🤗Datasets	1	1097	April 15, 2024
Out of Memory error with multi-gpu training but no error with just one gpu? Amazon SageMaker	0	463	December 12, 2023
Get UserWarning: "Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector" when run official example code run_mlm.py 🤗Transformers	1	2189	November 17, 2022

HuggingFace summarization training example notebook raises two warnings when run on multi-GPUs

Related topics