Accelerate / TPU with bigger models: process 0 terminated with signal SIGKILL

Hello all,
I’ve written a chatbot that works fine in a Trainer / PyTorch based environment mode on one GPU and with different models.

I tested with distilbert-base-uncased, bert-large-uncased, roberta-base, roberta-large, microsoft/deberta-large.

After making necessary modifications to run the program with Accelerator on 8 TPU it works fine for distilbert-base-uncased. Using roberta-base model the program runs in slooow motion and for all other (bigger?) models the program terminates with the following error message:

Launching a training on 8 TPU cores.

loading configuration file https://huggingface.co/bert-large-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1cf090f220f9674b67b3434decfe4d40a6532d7849653eac435ff94d31a4904c.1d03e5e4fa2db2532c517b2cd98290d8444b237619bd3d2039850a6d5e86473d
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
...
   "LABEL_99": 99
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-large-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/1d959166dd7e047e57ea1b2d9b7b9669938a7e90c5e37a03961ad9f15eaea17f.fea64cd906e3766b04c92397f9ad3ff45271749cbe49829a079dd84e34c1697d

---------------------------------------------------------------------------

ProcessExitedException                    Traceback (most recent call last)

<ipython-input-54-a91f3c0bb4fd> in <module>()
      1 from accelerate import notebook_launcher
      2 
----> 3 notebook_launcher(training_function)

3 frames

/usr/local/lib/python3.7/dist-packages/accelerate/notebook_launcher.py in notebook_launcher(function, args, num_processes, use_fp16, use_port)
     67             launcher = PrepareForLaunch(function, distributed_type="TPU")
     68             print(f"Launching a training on {num_processes} TPU cores.")
---> 69             xmp.spawn(launcher, args=args, nprocs=num_processes, start_method="fork")
     70         else:
     71             # No need for a distributed launch otherwise as it's either CPU or one GPU.

/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    392         join=join,
    393         daemon=daemon,
--> 394         start_method=start_method)
    395 
    396 

/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    186 
    187     # Loop on join until it returns True or raises an exception.
--> 188     while not context.join():
    189         pass
    190 

/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    134                     error_pid=failed_process.pid,
    135                     exit_code=exitcode,
--> 136                     signal_name=name
    137                 )
    138             else:

ProcessExitedException: process 0 terminated with signal SIGKILL

I tested with different batch_sizes down to 1 and reduced max_length down to 32. No effect.

This case seems to be similiar to TPU memory issues.
Do I have a possibility to make some necessary modifications / settings or is the Accelerator / TPU currently not compatible with bigger models?

You won’t be able to use large models on Colab as they don’t give enough RAM on those instances to properly load the model, you will need to go through a GCP instance for that.

Do you know if it is possible at all to use accelerate with GPT-Neo models in Google Colab? I tried even GPT-Neo-125M and keeps failing in Google Colab using your notebook example: notebooks/simple_nlp_example.ipynb at main · huggingface/notebooks · GitHub

The error message is as below:

Launching a training on 8 TPU cores.
loading configuration file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/29380fef22a43cbfb3d3a6c8e2f4fd951459584d87c34e4621b30580a54aca84.f0f7ebddfc6e15a23ac33e7fa95cd8cca05edf87cc74f9e3be7905f538a59762
Model config GPTNeoConfig {
  "activation_function": "gelu_new",
  "architectures": [
    "GPTNeoForCausalLM"
  ],
  "attention_dropout": 0,
  "attention_layers": [
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local"
  ],
  "attention_types": [
    [
      [
        "global",
        "local"
      ],
      6
    ]
  ],
  "bos_token_id": 50256,
  "embed_dropout": 0,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": null,
  "layer_norm_epsilon": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "gpt_neo",
  "num_heads": 12,
  "num_layers": 12,
  "resid_dropout": 0,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.19.0",
  "use_cache": true,
  "vocab_size": 50257,
  "window_size": 256
}

loading weights file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/b0ace3b93ace62067a246888f1e54e2d3ec20807d4d3e27ac602eef3b7091c0b.6525df88f1d5a2d33d95ce2458ef6af9658fe7d1393d6707e0e318779ccc68ff
Exception in device=TPU:7: tensorflow/compiler/xla/xla_client/mesh_service.cc:331 : Failed to retrieve mesh configuration: Connection reset by peer (14)
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 322, in _start_fn
    _setup_replication()
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 314, in _setup_replication
    device = xm.xla_device()
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 231, in xla_device
    devkind=devkind if devkind is not None else None)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 136, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 18, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:331 : Failed to retrieve mesh configuration: Connection reset by peer (14)
---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
<ipython-input-41-a91f3c0bb4fd> in <module>()
      1 from accelerate import notebook_launcher
      2 
----> 3 notebook_launcher(training_function)

3 frames
/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    134                     error_pid=failed_process.pid,
    135                     exit_code=exitcode,
--> 136                     signal_name=name
    137                 )
    138             else:

ProcessExitedException: process 0 terminated with signal SIGSEGV