Accelerate / TPU with bigger models: process 0 terminated with signal SIGKILL

Hello all,
I’ve written a chatbot that works fine in a Trainer / PyTorch based environment mode on one GPU and with different models.

I tested with distilbert-base-uncased, bert-large-uncased, roberta-base, roberta-large, microsoft/deberta-large.

After making necessary modifications to run the program with Accelerator on 8 TPU it works fine for distilbert-base-uncased. Using roberta-base model the program runs in slooow motion and for all other (bigger?) models the program terminates with the following error message:

Launching a training on 8 TPU cores.

loading configuration file from cache at /root/.cache/huggingface/transformers/1cf090f220f9674b67b3434decfe4d40a6532d7849653eac435ff94d31a4904c.1d03e5e4fa2db2532c517b2cd98290d8444b237619bd3d2039850a6d5e86473d
Model config BertConfig {
  "architectures": [
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
   "LABEL_99": 99
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522

loading weights file from cache at /root/.cache/huggingface/transformers/1d959166dd7e047e57ea1b2d9b7b9669938a7e90c5e37a03961ad9f15eaea17f.fea64cd906e3766b04c92397f9ad3ff45271749cbe49829a079dd84e34c1697d


ProcessExitedException                    Traceback (most recent call last)

<ipython-input-54-a91f3c0bb4fd> in <module>()
      1 from accelerate import notebook_launcher
----> 3 notebook_launcher(training_function)

3 frames

/usr/local/lib/python3.7/dist-packages/accelerate/ in notebook_launcher(function, args, num_processes, use_fp16, use_port)
     67             launcher = PrepareForLaunch(function, distributed_type="TPU")
     68             print(f"Launching a training on {num_processes} TPU cores.")
---> 69             xmp.spawn(launcher, args=args, nprocs=num_processes, start_method="fork")
     70         else:
     71             # No need for a distributed launch otherwise as it's either CPU or one GPU.

/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/ in spawn(fn, args, nprocs, join, daemon, start_method)
    392         join=join,
    393         daemon=daemon,
--> 394         start_method=start_method)

/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/ in start_processes(fn, args, nprocs, join, daemon, start_method)
    187     # Loop on join until it returns True or raises an exception.
--> 188     while not context.join():
    189         pass

/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/ in join(self, timeout)
    134           ,
    135                     exit_code=exitcode,
--> 136                     signal_name=name
    137                 )
    138             else:

ProcessExitedException: process 0 terminated with signal SIGKILL

I tested with different batch_sizes down to 1 and reduced max_length down to 32. No effect.

This case seems to be similiar to TPU memory issues.
Do I have a possibility to make some necessary modifications / settings or is the Accelerator / TPU currently not compatible with bigger models?

You won’t be able to use large models on Colab as they don’t give enough RAM on those instances to properly load the model, you will need to go through a GCP instance for that.