Hello,
I am running an example summarization training task taken from here (official HuggingFace example) on a multi-GPU machine, using the following versions: torch==1.11.0+cu113
and transformers==4.20.1
. The only difference is that instead of using google/mt5-small
as model I am using facebook/bart-base
I am getting two warnings. I believe they are raised when the model is trying to gather results from multiple but I’m struggling to understand whether they are normal or an indication that something isn’t working with the code. The first warning is
Parameter ‘function’=<function preprocess_function at 0x7f797c2de0e0> of the transform datasets.arrow_dataset.Dataset._map_single couldn’t be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won’t be showed.
The other warning is
venv/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
Could someone tell me if these warnings are likely the result of something not working properly with the code and failing silently (code runs without raising any exceptions)?
For reference I copy below a small reproducible example of the code generating the warnings., adapted from the official HuggingFace notebook linked above.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
from datasets import load_dataset
model_checkpoint = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
english_dataset = load_dataset("amazon_reviews_multi", "en")
def preprocess_function(examples):
model_inputs = tokenizer(examples["review_body"], max_length=512, truncation=True)
# Set up the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["review_title"], max_length=30, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_datasets = english_dataset.map(preprocess_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(
english_dataset["train"].column_names
)
features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
data_collator(features)
batch_size = 8
num_train_epochs = 8
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
output_dir=f"{model_name}-finetuned-amazon-en-es",
evaluation_strategy="epoch",
learning_rate=5.6e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=num_train_epochs,
predict_with_generate=True,
logging_steps=logging_steps,
push_to_hub=False,
)
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer
)
trainer.train()
trainer.evaluate()