HuggingFace summarization training example notebook raises two warnings when run on multi-GPUs

Hello,
I am running an example summarization training task taken from here (official HuggingFace example) on a multi-GPU machine, using the following versions: torch==1.11.0+cu113 and transformers==4.20.1. The only difference is that instead of using google/mt5-small as model I am using facebook/bart-base

I am getting two warnings. I believe they are raised when the model is trying to gather results from multiple but I’m struggling to understand whether they are normal or an indication that something isn’t working with the code. The first warning is

Parameter ‘function’=<function preprocess_function at 0x7f797c2de0e0> of the transform datasets.arrow_dataset.Dataset._map_single couldn’t be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won’t be showed.

The other warning is

venv/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.

Could someone tell me if these warnings are likely the result of something not working properly with the code and failing silently (code runs without raising any exceptions)?
For reference I copy below a small reproducible example of the code generating the warnings., adapted from the official HuggingFace notebook linked above.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
from datasets import load_dataset

model_checkpoint = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

english_dataset = load_dataset("amazon_reviews_multi", "en")

def preprocess_function(examples):
    model_inputs = tokenizer(examples["review_body"], max_length=512, truncation=True)
    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["review_title"], max_length=30, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = english_dataset.map(preprocess_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(
    english_dataset["train"].column_names
)
features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
data_collator(features)

batch_size = 8
num_train_epochs = 8
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-amazon-en-es",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

trainer.train()
trainer.evaluate()
1 Like

I have the same issue, do you resolve it?

Not yet unfortunately :frowning:
Code seems to run fine but the warnings are still there and I’m not sure what should be done about them

Hello, I find that this problem may be caused by trainer. When I used script run_glue.py, I got the UserWarning. But when I used script run_glue_no_trainer.py, everything is ok.

Thanks for pointing this out,. Unfortunately I am not using their training script but a custom one. What I would like to understand is why there is this issue and what underlining code part is causing it. Knowing that run_glue.py raises the warning but run_glue_no_trainer.py does not, may be helpful if you plan to use those, but does not really address my query as to why this is happening. For this reason I will keep the question open

did this help you? Using Transformers with DistributedDataParallel — any examples?