Accelerate / TPU with bigger models: process 0 terminated with signal SIGKILL

ocm · September 26, 2021, 10:34am

Hello all,
I’ve written a chatbot that works fine in a Trainer / PyTorch based environment mode on one GPU and with different models.

I tested with distilbert-base-uncased, bert-large-uncased, roberta-base, roberta-large, microsoft/deberta-large.

After making necessary modifications to run the program with Accelerator on 8 TPU it works fine for distilbert-base-uncased. Using roberta-base model the program runs in slooow motion and for all other (bigger?) models the program terminates with the following error message:

Launching a training on 8 TPU cores.

loading configuration file https://huggingface.co/bert-large-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1cf090f220f9674b67b3434decfe4d40a6532d7849653eac435ff94d31a4904c.1d03e5e4fa2db2532c517b2cd98290d8444b237619bd3d2039850a6d5e86473d
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
...
   "LABEL_99": 99
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-large-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/1d959166dd7e047e57ea1b2d9b7b9669938a7e90c5e37a03961ad9f15eaea17f.fea64cd906e3766b04c92397f9ad3ff45271749cbe49829a079dd84e34c1697d

---------------------------------------------------------------------------

ProcessExitedException                    Traceback (most recent call last)

<ipython-input-54-a91f3c0bb4fd> in <module>()
      1 from accelerate import notebook_launcher
      2 
----> 3 notebook_launcher(training_function)

3 frames

/usr/local/lib/python3.7/dist-packages/accelerate/notebook_launcher.py in notebook_launcher(function, args, num_processes, use_fp16, use_port)
     67             launcher = PrepareForLaunch(function, distributed_type="TPU")
     68             print(f"Launching a training on {num_processes} TPU cores.")
---> 69             xmp.spawn(launcher, args=args, nprocs=num_processes, start_method="fork")
     70         else:
     71             # No need for a distributed launch otherwise as it's either CPU or one GPU.

/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    392         join=join,
    393         daemon=daemon,
--> 394         start_method=start_method)
    395 
    396 

/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    186 
    187     # Loop on join until it returns True or raises an exception.
--> 188     while not context.join():
    189         pass
    190 

/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    134                     error_pid=failed_process.pid,
    135                     exit_code=exitcode,
--> 136                     signal_name=name
    137                 )
    138             else:

ProcessExitedException: process 0 terminated with signal SIGKILL

I tested with different batch_sizes down to 1 and reduced max_length down to 32. No effect.

This case seems to be similiar to TPU memory issues.
Do I have a possibility to make some necessary modifications / settings or is the Accelerator / TPU currently not compatible with bigger models?

sgugger · September 26, 2021, 4:05pm

You won’t be able to use large models on Colab as they don’t give enough RAM on those instances to properly load the model, you will need to go through a GCP instance for that.

lkqiku · May 13, 2022, 9:48am

Do you know if it is possible at all to use accelerate with GPT-Neo models in Google Colab? I tried even GPT-Neo-125M and keeps failing in Google Colab using your notebook example: notebooks/simple_nlp_example.ipynb at main · huggingface/notebooks · GitHub

The error message is as below:

Launching a training on 8 TPU cores.
loading configuration file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/29380fef22a43cbfb3d3a6c8e2f4fd951459584d87c34e4621b30580a54aca84.f0f7ebddfc6e15a23ac33e7fa95cd8cca05edf87cc74f9e3be7905f538a59762
Model config GPTNeoConfig {
  "activation_function": "gelu_new",
  "architectures": [
    "GPTNeoForCausalLM"
  ],
  "attention_dropout": 0,
  "attention_layers": [
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local"
  ],
  "attention_types": [
    [
      [
        "global",
        "local"
      ],
      6
    ]
  ],
  "bos_token_id": 50256,
  "embed_dropout": 0,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": null,
  "layer_norm_epsilon": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "gpt_neo",
  "num_heads": 12,
  "num_layers": 12,
  "resid_dropout": 0,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.19.0",
  "use_cache": true,
  "vocab_size": 50257,
  "window_size": 256
}

loading weights file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/b0ace3b93ace62067a246888f1e54e2d3ec20807d4d3e27ac602eef3b7091c0b.6525df88f1d5a2d33d95ce2458ef6af9658fe7d1393d6707e0e318779ccc68ff
Exception in device=TPU:7: tensorflow/compiler/xla/xla_client/mesh_service.cc:331 : Failed to retrieve mesh configuration: Connection reset by peer (14)
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 322, in _start_fn
    _setup_replication()
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 314, in _setup_replication
    device = xm.xla_device()
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 231, in xla_device
    devkind=devkind if devkind is not None else None)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 136, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 18, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:331 : Failed to retrieve mesh configuration: Connection reset by peer (14)
---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
<ipython-input-41-a91f3c0bb4fd> in <module>()
      1 from accelerate import notebook_launcher
      2 
----> 3 notebook_launcher(training_function)

3 frames
/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    134                     error_pid=failed_process.pid,
    135                     exit_code=exitcode,
--> 136                     signal_name=name
    137                 )
    138             else:

ProcessExitedException: process 0 terminated with signal SIGSEGV

Topic		Replies	Views
TPU memory issues 🤗Accelerate	0	1599	May 30, 2021
Struggle with training on TPU using 'accelerate' library 🤗Accelerate	3	1723	March 7, 2022
Found a BUG and basic docs code fails to run on kaggle tpu 🤗Accelerate	0	352	September 15, 2023
Simple NLP Example not working 🤗Accelerate	16	4772	June 23, 2022
How to save model in Colab during TPU training with Accelerate Intermediate	2	1387	November 19, 2021

Accelerate / TPU with bigger models: process 0 terminated with signal SIGKILL

Related topics