Simple NLP Example not working

Hi,
I’m looking for an example to learn how to use TPUs on Colab running PyTorch.
I’m glad to find the Simple NLP Example which is unfortunately not working.
Running w/o modifications leads to following error message running the last cell:

from accelerate import notebook_launcher

notebook_launcher(training_function)

---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

<ipython-input-50-a91f3c0bb4fd> in <module>()
      1 from accelerate import notebook_launcher
      2 
----> 3 notebook_launcher(training_function)

1 frames

/usr/local/lib/python3.7/dist-packages/torch_xla/__init__.py in <module>()
     99 from ._patched_functions import _apply_patches
    100 from .version import __version__
--> 101 import _XLAC
    102 
    103 

ImportError: /usr/local/lib/python3.7/dist-packages/_XLAC.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1010TensorImpl20is_contiguous_customENS_12MemoryFormatE


---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

I found a workaround description here which says:

... downgrading PyTorch to torch-1.8.2+cpu,
but that leads to another error message

ProcessExitedException: process 0 terminated with signal SIGSEGV

What is necessary to run that example?
Do you know any other example that meets my requirements (Colab, TPUs, PyTorch) and runs?

Thanks for any comment

1 Like

I’m guessing this is due to a version mismatch between PyTorch XLA and PyTorch (PyTorch XLA is installed with a version for PyTorch 1.9 and Colab now uses PyTorch 1,10). I’ve asked for an updated link to install the proper version of PyTorch XLA, but in the meantime, you cna solve the issue by downgrading PyTorch to 1.9.1 in the Colab you are running.

On request of the pytorch version I get “1.9.0+cu102”
What exactly am I supposed to do?

Ah you’re right, I was confused. It’s working now with no change. Perhaps there was some maintenance isse?

Great, thanks

Weirdly enough, the notebook example is working for me with the normal RAM TPU from colab (if I change the pytorch version to !pip3 install torch==1.9 from the new colab default torch==1.10).

But I get the same SIGSEGV error, when changing to the high-RAM TPU from colab. the exact same code works with the low-ram TPU, does does not work with the high-ram TPU. probably an issue with google colab’s setup in the background …

Colab with traceback here: simple_nlp_example.ipynb - Google Drive

I too am seeing the SIGSEV error when I run the Simple NLP example on an A100 pod:

---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
Input In [25], in <cell line: 3>()
      1 from accelerate import notebook_launcher
----> 3 notebook_launcher(training_function, num_processes=4)

File ~/envs/wav2vec/lib/python3.9/site-packages/accelerate/launchers.py:129, in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port)
    126         launcher = PrepareForLaunch(function, distributed_type="MULTI_GPU")
    128         print(f"Launching training on {num_processes} GPUs.")
--> 129         start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
    131 else:
    132     # No need for a distributed launch otherwise as it's either CPU or one GPU.
    133     if torch.cuda.is_available():

File ~/envs/wav2vec/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:198, in start_processes(fn, args, nprocs, join, daemon, start_method)
    195     return context
    197 # Loop on join until it returns True or raises an exception.
--> 198 while not context.join():
    199     pass

File ~/envs/wav2vec/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:140, in ProcessContext.join(self, timeout)
    138 if exitcode < 0:
    139     name = signal.Signals(-exitcode).name
--> 140     raise ProcessExitedException(
    141         "process %d terminated with signal %s" %
    142         (error_index, name),
    143         error_index=error_index,
    144         error_pid=failed_process.pid,
    145         exit_code=exitcode,
    146         signal_name=name
    147     )
    148 else:
    149     raise ProcessExitedException(
    150         "process %d terminated with exit code %d" %
    151         (error_index, exitcode),
   (...)
    154         exit_code=exitcode
    155     )

ProcessExitedException: process 0 terminated with signal SIGSEGV

That error was also preceded by a bunch of warnings BTW:

Launching training on 4 GPUs.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

The warnings are normal (though something we should try and fix in the script if possible), however I’ll look at the nb launcher error. Given we just made it work for the fastai integration (with now changes), I can say that should work.

To make sure I can reproduce this, it’s an a100 system? Could you also run accelerate env and print that here as well please? :slight_smile:

Hey Zach!! :slight_smile: Yes, It’s an A100.

$ accelerate env

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.9.0
- Platform: Linux-5.4.0-1060-aws-x86_64-with-glibc2.27
- Python version: 3.9.9
- Numpy version: 1.22.4
- PyTorch version (GPU?): 1.11.0+cu115 (True)
- `Accelerate` default config:
	Not found

Do I need to setup a default config?

I belive so, run accelerate config and answer the questions before trying again :smile:

Ok, I ran that and answered the questions, and restarted and re-ran but still got the SIGSEV error. Here’s the new accelerate env output:

- `Accelerate` version: 0.9.0
- Platform: Linux-5.4.0-1060-aws-x86_64-with-glibc2.27
- Python version: 3.9.9
- Numpy version: 1.22.4
- PyTorch version (GPU?): 1.11.0+cu115 (True)
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: fp16
	- use_cpu: False
	- num_processes: 4
	- machine_rank: 0
	- num_machines: 1
	- main_process_ip: None
	- main_process_port: None
	- main_training_function: main
	- deepspeed_config: {}
	- fsdp_config: {}

Update: tried reconfiguring with FSDP on (and then chose the defaults presented thereafer), restarting, and re-running. Same SIGSEV error.

Thanks for your patience with this! For the time being I’d recommend using !accelerate launch myscript instead of the notebook launcher. We’re working on a fix for that but I can confirm that using it in script form works fine

Or don’t use the High VRAM instance

Hello, Please refer notebook_launcher throws SIGSEGV error when using pretrained transformer models for NLP tasks · Issue #440 · huggingface/accelerate (github.com) for more context as per my deep dive.

Good news is move the from_pretrained from outside of the train loop and use the already downloaded one available and it will run fine. The bug stems from that function specifically.

For keeping an eye on progress:

1 Like

Thanks Zach! I’d put this issue on the backburner for a while but am glad to see this progress.
Seems that some colleagues of mine are able to use accelerate on High VRAM instances, but they aren’t using the notebook environment. (This is important because I do need to be able to run on High VRAM, so just “don’t do it” wasn’t going to help me. LOL).

EDIT: BTW, I’d been seeing this error on an AWS A100 too, not just Colab.

Should be fixed now, i just need to update the nb but the Accelerate specific bugs should be solved outside of that specific cause.

Declare the model outside the training function, and pass it in as an argument instead. A PR with this fix will go live tommorow as well.

The model should only be declared once on a TPU and its passed back and forth, rather than duplicated in a notebook/forked process

The notebook has been updated to show how to properly do this and it should be fixed. Please let me know if you still see this issue!

1 Like