Tensorboard does not load on hub, loads locally, tfevents files are uploaded to hub

Trained a model using trainer.train() and trainer.push().

training_args = Seq2SeqTrainingArguments(
    output_dir=fine_tuned_model,  # same name as model
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  
    learning_rate=1e-5,
    warmup_steps=50,
    max_steps=max_steps,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=200,
    eval_steps=200,
    logging_steps=200,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
    save_total_limit=1,
    hub_strategy="all_checkpoints"
)
from transformers import Seq2SeqTrainer
from transformers.integrations import TensorBoardCallback

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=audio_dataset["train"],
    eval_dataset=audio_dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[TensorBoardCallback()]
)

The model is pushed to hub.

The logs messages show tf events files are pushed to hub.

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]
events.out.tfevents.1708962517.57b658957df3.26.1:   0%|          | 0.00/9.64k [00:00<?, ?B/s]
events.out.tfevents.1708962517.57b658957df3.26.0:   0%|          | 0.00/9.64k [00:00<?, ?B/s]

I can view tensorboard on local computer

However when I open tensorboard on hub it errors out

### No dashboards are active for the current data set.

Probable causes:

* You haven’t written any data to your event files.
* TensorBoard can’t find your event files.

Any pointers to load tensorboard on hub ?

Were you able to fix this? I’ve come across the same problem, loads locally but doesnt work online.

I suggest there is something wrong with the way Tensor is trying to extract the data, or more specifically where its trying to fetch its data from. Any further help would be appreciated

@LennyBijan Thank you for responding. It appears to be intermittent huggingface issue. Tensorboard loads fine for that model now. They do not load for other model(s).

Note - I uploaded same tf events files to wandb and they load just fine for every model

Happy to share if and when i find definite answer.

Keeping the discussion open.
Any findings are welcome.

Thank you

1 Like

Any update on this? I’m experiencing the same behavior, it seems to be transient. Sometimes it will correctly load other time it will fail.

same issue here. I can load the tensorboard on colab but can’t load it under hugging face training metric tab.

I’m having the exact same issue. It loads the “training metrics” tab only for some models and for some others it doesn’t, even if the code producing them is the same.

It has been fixed! Thanks for reporting

Hello, I’m experiencing the same problem today, @severo are there any solutions for that?