Trouble finetuning my model on Trn1 (no checkpoints saved, zero_1 not working, no evaluation possible)

Hello huggingface community!

First, the good news: Thanks to huggingface, I was able to create my own model and test train it successfully locally using pytorch and my own GPU. I leverage the Trainer API from the transformers library. So far, so good. Now, for the real training, I switched to AWS Trn1.2xlarge instances using the Hugging Face Neuron Deep Learning AMI and swapped in my code Trainer and TrainingArguments with NeuronTrainer and NeuronTrainingArguments. After some quirks, I get the model to train nicely, and it learns as well, checking on my evaluation dataset after downloading the model to my local machine. Great! And I am pretty happy.

However, I still have some trouble/problems left, and maybe someone here can help me solve these issues:

  1. I cannot evaluate the model while training (for example, after each epoch). If I try to set a custom metrics function and use my evaluation dataset with NeuronTrainer, I get an error: “Evaluation is not supported in NeuronTrainer.”. So, is it really true that I cannot evaluate my model on Trn1 instance?

  2. When I tried to save checkpoints, I ran into trouble as well. If I set save_strategy = “epoch”, no checkpoints get saved to the output dir (set to output_dir = “./training” in NeuronTrainingArguments). I have no clue why, by the way. Locally, it all works fine.

  3. When I switched to save_strategy=”steps”, I first got an error (something with device/rank parameter empty or so). After some fiddling around, I noticed it was in a module called zero something, and after setting zero_1 = False in NeuronTrainingArguments, the models/checkpoints get saved as expected. However, now after the first checkpoint is saved, the grad_norm becomes 0, and the model stops learning. Does anyone have a clue why?

So, basically, while able to train on Trn1 instances now, I have to do this virtually blindly. I cannot evaluate on the instance while training, nor can I save checkpoints while training. Only the final model at the end works as expected. Maybe someone had similar trouble and can help me solve these issues.

Thanks a lot,

Stephan

1 Like

The XLA environment is pretty hellish, so I’m not even sure if the following is correct…:sweat_smile:


What your symptoms strongly suggest (Trainium / XLA context)

On Trn1, training runs on XLA (via Neuron + torch_xla), and Optimum’s NeuronTrainer is not feature-equivalent to the GPU Trainer. Two consequences matter most in your case:

  1. Evaluation is intentionally disabled in NeuronTrainer (it throws). (GitHub)
  2. Checkpointing is a synchronization-sensitive operation on XLA, and certain save sequences can cause hangs or even tensor corruption unless steps are marked/flushed correctly. (awsdocs-neuron.readthedocs-hosted.com)

Your “training stops learning right after the first checkpoint” is exactly the kind of “save boundary” failure XLA users run into when execution diverges across ranks or when saving interacts badly with lazy execution.


1) Evaluation: why you’re blocked, and what “supported” looks like on Trn1

Why the error happens

NeuronTrainer raises: “Evaluation is not supported in NeuronTrainer.” (GitHub)
So yes: with NeuronTrainer, you cannot do the usual evaluation_strategy="epoch"/"steps", compute_metrics, eval_dataset, etc.

The confusing part is that NeuronTrainingArguments still exposes many eval-related knobs (e.g., do_eval, eval_strategy, eval_steps) in the docs, but the trainer itself blocks the eval path. (Hugging Face)

Practical workaround that matches how Optimum-Neuron is designed

Do evaluation out-of-band:

  • Save checkpoints periodically during training
  • Consolidate shards if needed (common with Neuron distributed)
  • Evaluate in a separate process (on a GPU box/local, or as a separate job on AWS)

This is the standard operational pattern when the training runtime can’t cheaply/robustly run eval inside the same compiled execution.

If you must evaluate on the Trn1 machine during training

You typically switch to a non-NeuronTrainer approach (plain HF Trainer on torch-neuronx/torch-xla). That route supports do_eval, but you inherit stricter “XLA discipline” (static shapes, compilation behavior, etc.). (This is a different training stack than Optimum’s NeuronTrainer.)


2) “save_strategy='epoch' saves nothing”: the most likely causes

Likely cause A: you’re expecting “GPU-style checkpoints,” but you’re getting sharded output / rank-scoped writes

On Neuron distributed setups, checkpoints can be sharded and not look like a single pytorch_model.bin. Optimum-Neuron’s distributed training guide describes shard-based layouts and how ZeRO-1 behaves. (Hugging Face)

Also, in NeuronX Distributed, not every rank writes the same thing (e.g., some states are written on DP rank 0 depending on configuration). (awsdocs-neuron.readthedocs-hosted.com)
If you’re inspecting the directory from a different rank/context, it can look like “nothing saved.”

Likely cause B: epoch boundaries are not being reached in the way you think

If you set max_steps (or your dataloader is effectively step-driven), the trainer can behave as “step-based,” and epoch-end callbacks may not trigger as expected.

Likely cause C: relative output_dir="./training" + multi-process launch

Relative paths are a common footgun when launching multi-process jobs. Even when it “works,” you can end up with output written somewhere other than where you’re looking. (This is not Neuron-specific, but it shows up more often on Trn1 runs because you usually launch distributed.)

What I would do immediately: use an absolute output directory on a mounted volume you know persists (e.g., EBS), and verify the directory contents from the primary process.


3) save_strategy='steps' + rank/device error + zero_1 + “grad_norm becomes 0”

This is the core of your case. Here are the two most plausible, high-signal explanations.

Explanation 1: XLA save boundary causes replica divergence or state corruption

Two related XLA pitfalls are documented:

  • Saving without a proper “step mark” can corrupt tensors in certain xm.save() sequences due to parameter aliasing. AWS explicitly recommends calling xm.mark_step() before xm.save() to avoid this class of issue. (awsdocs-neuron.readthedocs-hosted.com)
  • If only the “master” replica runs checkpointing code while others proceed differently, execution can diverge and hang (or behave incorrectly) because XLA execution is lazy/graph-based. The PyTorch/XLA issue explicitly illustrates the save order problem and why checkpointing must be coordinated. (GitHub)

Why this matches your symptom: you see normal learning until the first checkpoint save, then grad_norm collapses to 0 and training stops improving. That is a classic “something went wrong at the checkpoint boundary” signal on XLA.

Explanation 2: you’re interpreting grad_norm from a rank/stage where it becomes meaningless after the first save

Optimum-Neuron’s trainer tracks grad_norm (and in some distributed/pipeline setups, what gets logged per process can be misleading). (GitHub)
This is less likely given your statement that the model “stops learning,” but it’s worth verifying by looking at loss and/or evaluating the saved checkpoint externally.


Where zero_1 fits in (and why toggling it changes the failure mode)

zero_1 is ZeRO-1 optimizer-state sharding. Optimum’s own docs describe ZeRO-1 and when it’s beneficial. (Hugging Face)

Checkpointing with ZeRO-1 is a known sharp edge in Neuron stacks historically; NeuronX Distributed release notes explicitly mention “Fixed an issue with Zero1 checkpoint saving/loading” and also note that checkpointing is sharded and needs combining. (awsdocs-neuron.readthedocs-hosted.com)

So, a very practical interpretation of your experience is:

  • With zero_1=True, you hit a bug / incompatibility in the checkpointing path (rank/device empty error).
  • With zero_1=False, you avoid that path, but you still hit an XLA save boundary problem that knocks training off course after the first save (because saving is still happening, just via a different route).

This is consistent with the fact that Optimum-Neuron has ongoing trainer refactors and training-related fixes across releases. (GitHub)


What I would do for your run (a concrete plan)

Step 0: Make the run observable

Even if you can’t evaluate in-trainer, you can still avoid “blind training”:

  • Save checkpoints periodically
  • Evaluate those checkpoints externally (GPU/local)
  • Track a simple scalar like training loss + learning rate over time

Step 1: Switch to the “least fragile” checkpointing mode first

Use step-based saving, and initially simplify what you save:

  • save_strategy="steps"
  • save_steps = something moderate (e.g., every few hundred optimizer steps)
  • save_only_model=True (Optimum-Neuron exposes this) (Hugging Face)

This removes optimizer/scheduler/RNG state from the checkpoint, which reduces the ZeRO-1 and XLA serialization surface area. It’s ideal if your goal is “evaluate during training,” not “resume training exactly from checkpoint.”

Step 2: Keep XLA serialization settings conservative

Optimum-Neuron documents defaults like use_xser=True and async_save=False in the trainer args. (Hugging Face)
Stick with:

  • use_xser=True
  • async_save=False

because async saving increases complexity/memory pressure and can make save boundaries harder to reason about.

Step 3: Address the XLA save boundary hazard

Because AWS has a documented “xm.save() sequence can corrupt tensors” note, I would treat this as a version-sensitive issue and do one of:

  • Upgrade/downgrade to a Neuron stack version where the xm.save issue is mitigated (or where Optimum has incorporated the safe sequence), guided by your AMI’s supported versions.
  • Ensure the code path doing the save includes the “mark step before save” discipline that AWS recommends. (awsdocs-neuron.readthedocs-hosted.com)

You may not control Optimum’s internal save calls directly, so in practice this often means: move to a newer Optimum-Neuron + torch-neuronx/torch-xla combo (or a known-good pinned set).

Step 4: Re-introduce zero_1=True once model-only saving is stable

Given that NxD release notes mention a Zero1 checkpoint fix, I would:

If that’s stable, and only then consider saving optimizer state (if you truly need resume).

Step 5: External evaluation loop (replaces in-training eval)

Since NeuronTrainer can’t evaluate, you do:

  1. Train with checkpoint saves
  2. After each checkpoint: run a separate evaluation script that loads the checkpoint and runs metrics
  3. Log results (even a CSV is enough)

This gives you “eval every epoch/steps” operationally, without relying on unsupported trainer paths. (GitHub)


“Why does the final model save work, but intermediate saves break training?”

This is common on XLA systems: end-of-training saving typically happens after a controlled barrier/teardown, while mid-loop saving happens inside the compiled/lazy execution flow. If that mid-loop save causes:

  • divergence across replicas, or
  • a bad save sequence that corrupts live tensors,
    you can see exactly your pattern: first checkpoint triggers the failure, everything after is flat.

AWS’s own warning about xm.save + the XLA checkpointing discussion are directly relevant to this “mid-loop save boundary” problem. (awsdocs-neuron.readthedocs-hosted.com)


Minimal “target state” for your situation

If your priority is: stable training + periodic evaluation, the most robust approach on Trn1 today is usually:

  • Train with NeuronTrainer (no eval)
  • Save model-only checkpoints by steps
  • Evaluate externally
  • Only attempt full “resume-able” checkpoints (optimizer/scheduler) once the above is stable, and only with a version set where ZeRO-1 checkpointing is known-good (awsdocs-neuron.readthedocs-hosted.com)

If you want the fastest pinpoint fix

The two most diagnostic pieces of info (no back-and-forth needed) are:

  • versions: optimum-neuron, neuronx-distributed, torch-neuronx, torch-xla
  • the exact checkpoint-save log lines around the first save (plus whether you launched with torchrun and how many ranks)

With that, the likely resolution becomes either:

Wow! Thanks John6666, this is an exhaustive answer, and so shortly after I posted my questions. This community is amazing, thank you so much!

I read through it in detail, and it all makes sense to me. I will need some time to digest it all and follow your very good suggestions. Some points I had in mind as well to test today, but most things were new to me. So thank you again.

Some comments:

  1. My current setup is indeed training on Trn1 (for 1-2 epochs), saving the final model, downloading and evaluating it on my local machine while I restart training for another 1-2 epochs. This is somewhat annoying, but it works.

  2. Yes, I used torchrun with --nproc-per-node=2 options, so I will first follow your suggestions about running non-distributed on Trn1 and also using save_only_model=True. I will definitively check the versions now against your posted links, bug reports, and discussions. Thank you again.

  3. Using an absolute output directory is also a very good point. At the moment only the directory gets created and contains a subfolder named “runs” with content, but no checkpoints get saved there.

Thanks again, and now I have to go try and test a lot of things. :sweat_smile:

I prepared a long answer about what I found out, but I am only allowed to post 2 links as a new user to the forum. Therefore, I have to wait with my answer until I am not “new” anymore. First, I have to find out what that means.

Ok, here’s my update on what I tried and found out.

First, the version numbers:

Neuron Driver: 2.24.7.0
Optimum Neuron: 0.4.4
NeuronX Distributed: 0.15.22404+1f27bddf
Torch NeuronX: 2.8.0.2.10.16998+e9bf8a50
Torch XLA: 2.8.1

I have tried the following things without success:

  1. absolute path for output_dir

  2. running torchrunwithout --nproc_per_node so using only one core.

  3. save_only_model=True using save_strategy = “steps” or save_strategy = “epoch”

I always got the same outcome. No checkpoints for epochs, but checkpoints for steps with grad_norm going to 0 after the checkpoint is saved.

After this I went quite deep into the source code of NeuronTrainer and Trainer and looked for differences. I found some interesting things, which partially explain my outcomes:

  1. I think NeuronTrainer is missing the source code to actually save the model after each epoch. Here is a comparison.

In Trainer the model is saved at the end of the epoch:

The function maybe_log_save_evaluate performs a save operation:

via _save_checkpoint:

However, this is not the case in NeuronTrainer:

However, the code for saving in steps is present in both.

In Trainer again via the _maybe_log_save_evaluate function:

In NeuronTrainer via the _maybe_save_checkpoint function :

I called the maybe_save_checkpoint method myself in a callback handler for on_epoch_end, which worked, but still with the same outcome that the grad_norm is 0 afterwards.

I also check which code is used for saving the model by NeuronTrainer, and I found that it is done here in the save_model function:

However, here are two ways to save the model depending on self.trn_config.model_parallelism_enabled. If it is activated, then the model is saved via the model’s save_pretrained method:

which I think uses the standard method from save_pretrained for saving from the NeuronModelMixin here:

which is torch.save.

If model parallelism is not enabled, which is the case for my setup, it goes via this path:

using xm.save instead.

As far as I can see, all demos and tutorials have in some way model parallelism activated, which makes sense for big models on AWS Trn1 instances. Unfortunately, when I tried to activate it for my model, I got the following error:

[rank1]: NotImplementedError: Model parallelism is only supported for models with a custom modeling implementation.

This is where I am stuck now.

So, in summary, evaluation is not implemented in NeuronTrainer. Saving after each epoch is also missing in the source code. I could implement it myself via a callback; however, I still run into the same problem of grad_norm being zero, which might have something to do with the difference of using torch.save (and saving shraded) vs. xm.save which is currently used in my setup in the background by NeuronTrainer.

Any further ideas?

What you need is a clean, practical, step‑by‑step plan you can give to the user on Hugging Face — something that cuts through the chaos and tells them exactly what to do next.

Below is the clearest, most useful version of that.


A Clean, Actionable Explanation You Can Post

Here’s a polished reply you can give them — it’s accurate, grounded, and avoids all the noise:


Reply:

Your symptoms line up with known limitations of the current NeuronTrainer stack on Trn1. The short version is:

  1. Evaluation during training is not supported
    NeuronTrainer intentionally disables evaluation.
    This isn’t a bug — it’s a design constraint of the XLA/Neuron execution model.

Workaround:
Train → save checkpoints → evaluate in a separate process (GPU or CPU).
This is the standard workflow for Trainium.


  1. Checkpoints not saving with save_strategy=“epoch”
    This usually happens because:
  • epoch boundaries don’t trigger the same way under XLA
  • only certain ranks write files
  • relative paths behave unpredictably in multi‑process launches

Fix:
Use an absolute output directory and switch to step‑based saving.


  1. savestrategy=“steps” works only when zero1=False, but training breaks after the first checkpoint
    This is the classic XLA “save boundary” issue:
  • mid‑training saves can corrupt tensors unless the save is coordinated correctly
  • ZeRO‑1 adds extra complexity to checkpointing
  • after the first save, gradients can collapse to zero if the replicas diverge

This is why your model stops learning after the first checkpoint.

Fix:
Start with the least fragile configuration:

  • zero_1 = False
  • saveonlymodel = True
  • async_save = False
  • use_xser = True
  • save every N steps (not epochs)

This avoids optimizer‑state sharding and reduces the risk of XLA divergence.

Once this is stable, you can reintroduce ZeRO‑1 if needed.


  1. The practical workflow that actually works on Trn1 today
    This is the setup most people end up using:

  2. Train with NeuronTrainer (no eval).

  3. Save model‑only checkpoints by steps.

  4. Evaluate each checkpoint externally (GPU/local).

  5. Only attempt full optimizer‑state checkpoints once the above is stable.

This avoids all the known XLA hazards while still giving you visibility into training progress.


If you want to debug further
The two most useful pieces of info are:

  • your versions of optimum-neuron, torch-neuronx, torch-xla, neuronx-distributed
  • the exact log lines around the first checkpoint save

With that, it’s usually possible to pinpoint whether you’re hitting:

  • a ZeRO‑1 checkpoint bug
  • an XLA save‑boundary issue
  • a version mismatch
  • or a rank‑divergence problem

I hope this helps?

Kind regards, Antony.