Accelerate / TPU with bigger models: process 0 terminated with signal SIGKILL

Hello all,
I’ve written a chatbot that works fine in a Trainer / PyTorch based environment mode on one GPU and with different models.

I tested with distilbert-base-uncased, bert-large-uncased, roberta-base, roberta-large, microsoft/deberta-large.

After making necessary modifications to run the program with Accelerator on 8 TPU it works fine for distilbert-base-uncased. Using roberta-base model the program runs in slooow motion and for all other (bigger?) models the program terminates with the following error message:

Launching a training on 8 TPU cores.

loading configuration file https://huggingface.co/bert-large-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1cf090f220f9674b67b3434decfe4d40a6532d7849653eac435ff94d31a4904c.1d03e5e4fa2db2532c517b2cd98290d8444b237619bd3d2039850a6d5e86473d
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
...
   "LABEL_99": 99
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-large-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/1d959166dd7e047e57ea1b2d9b7b9669938a7e90c5e37a03961ad9f15eaea17f.fea64cd906e3766b04c92397f9ad3ff45271749cbe49829a079dd84e34c1697d

---------------------------------------------------------------------------

ProcessExitedException                    Traceback (most recent call last)

<ipython-input-54-a91f3c0bb4fd> in <module>()
      1 from accelerate import notebook_launcher
      2 
----> 3 notebook_launcher(training_function)

3 frames

/usr/local/lib/python3.7/dist-packages/accelerate/notebook_launcher.py in notebook_launcher(function, args, num_processes, use_fp16, use_port)
     67             launcher = PrepareForLaunch(function, distributed_type="TPU")
     68             print(f"Launching a training on {num_processes} TPU cores.")
---> 69             xmp.spawn(launcher, args=args, nprocs=num_processes, start_method="fork")
     70         else:
     71             # No need for a distributed launch otherwise as it's either CPU or one GPU.

/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    392         join=join,
    393         daemon=daemon,
--> 394         start_method=start_method)
    395 
    396 

/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    186 
    187     # Loop on join until it returns True or raises an exception.
--> 188     while not context.join():
    189         pass
    190 

/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    134                     error_pid=failed_process.pid,
    135                     exit_code=exitcode,
--> 136                     signal_name=name
    137                 )
    138             else:

ProcessExitedException: process 0 terminated with signal SIGKILL

I tested with different batch_sizes down to 1 and reduced max_length down to 32. No effect.

This case seems to be similiar to TPU memory issues.
Do I have a possibility to make some necessary modifications / settings or is the Accelerator / TPU currently not compatible with bigger models?

You won’t be able to use large models on Colab as they don’t give enough RAM on those instances to properly load the model, you will need to go through a GCP instance for that.