SegFormer Semantic Segmentation cuda error

Hello,

I am trying to use the SegFormer and the TF version in particular. The PyTorch model works using the sample code given in here, but in the TF code, the following line:

model = TFSegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b4-finetuned-ade-512-512")

gives following errors:

2022-08-02 22:06:38.581314: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-02 22:06:38.946033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13589 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080 Ti Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
2022-08-02 22:06:40.230678: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
2022-08-02 22:06:40.870359: E tensorflow/core/platform/windows/subprocess.cc:287] Call to CreateProcess failed. Error code: 2, command: '"ptxas.exe" "--version"'
2022-08-02 22:06:40.870862: E tensorflow/core/platform/windows/subprocess.cc:287] Call to CreateProcess failed. Error code: 2, command: '"ptxas.exe" "--version"'
2022-08-02 22:06:40.871431: W tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas.exe --version
2022-08-02 22:06:40.875190: E tensorflow/core/platform/windows/subprocess.cc:287] Call to CreateProcess failed. Error code: 2, command: '"ptxas.exe" "C:\Users\xxx\AppData\Local\Temp\/tempfile-LAPTOP-H8L6H592-37bc-3780-5e546d1063d44" "-o" "C:\Users\xxx\AppData\Local\Temp\/tempfile-LAPTOP-H8L6H592-37bc-3780-5e546d10649e0" "-arch=sm_86" "--warn-on-spills"'
2022-08-02 22:06:40.875526: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: Failed to launch ptxas
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2022-08-02 22:06:41.366578: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2022-08-02 22:06:41.425569: I tensorflow/compiler/xla/service/service.cc:170] XLA service 0x27b18d09a10 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-08-02 22:06:41.425819: I tensorflow/compiler/xla/service/service.cc:178]   StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti Laptop GPU, Compute Capability 8.6
2022-08-02 22:06:41.449017: E tensorflow/core/platform/windows/subprocess.cc:287] Call to CreateProcess failed. Error code: 2, command: '"ptxas.exe" "--version"'
2022-08-02 22:06:41.449438: E tensorflow/core/platform/windows/subprocess.cc:287] Call to CreateProcess failed. Error code: 2, command: '"ptxas.exe" "--version"'
2022-08-02 22:06:41.450899: W tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas.exe --version
2022-08-02 22:06:41.454757: E tensorflow/core/platform/windows/subprocess.cc:287] Call to CreateProcess failed. Error code: 2, command: '"ptxas.exe" "C:\Users\xxx\AppData\Local\Temp\/tempfile-LAPTOP-H8L6H592-37bc-3780-5e546d10f140c" "-o" "C:\Users\xxx\AppData\Local\Temp\/tempfile-LAPTOP-H8L6H592-37bc-3780-5e546d10f21d1" "-arch=sm_86" "--warn-on-spills"'
2022-08-02 22:06:41.475889: E tensorflow/core/platform/windows/subprocess.cc:287] Call to CreateProcess failed. Error code: 2, command: '"ptxas.exe" "--version"'
2022-08-02 22:06:41.476139: W tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas.exe --version
2022-08-02 22:06:41.478855: E tensorflow/core/platform/windows/subprocess.cc:287] Call to CreateProcess failed. Error code: 2, command: '"ptxas.exe" "C:\Users\xxx\AppData\Local\Temp\/tempfile-LAPTOP-H8L6H592-37bc-3780-5e546d10f772f" "-o" "C:\Users\xxx\AppData\Local\Temp\/tempfile-LAPTOP-H8L6H592-37bc-3780-5e546d10f7fd2" "-arch=sm_86" "--warn-on-spills"'
2022-08-02 22:06:41.479058: W tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:640] INTERNAL: Failed to launch ptxas
Relying on driver to perform ptx compilation.
Setting XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda  or modifying $PATH can be used to set the location of ptxas
This message will only be logged once.
2022-08-02 22:06:41.681546: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.2
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2022-08-02 22:06:41.704345: E tensorflow/core/platform/windows/subprocess.cc:287] Call to CreateProcess failed. Error code: 2, command: '"ptxas.exe" "--version"'
2022-08-02 22:06:41.704644: W tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas.exe --version
2022-08-02 22:06:41.708784: E tensorflow/core/platform/windows/subprocess.cc:287] Call to CreateProcess failed. Error code: 2, command: '"ptxas.exe" "C:\Users\xxx\AppData\Local\Temp\/tempfile-LAPTOP-H8L6H592-37bc-3780-5e546d112f3e6" "-o" "C:\Users\xxx\AppData\Local\Temp\/tempfile-LAPTOP-H8L6H592-37bc-3780-5e546d1130200" "-arch=sm_86" "--warn-on-spills"'
2022-08-02 22:06:41.709103: F tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:456] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.

after which the script fails.

I have verified that tensorflow has GPU available and with other tf-based code samples CUDA works without problems. I have plenty of experience with TensorFlow but have not seen this behaviour before. I am using transformers=4.21.0, cudatoolkit=11.2.2, cudnn=8.1.0.77 and tensorflow=2.9.1.

Any tips or suggestions? Cheers!

cc @amyeroberts

Hi @eppane - thanks for posting!

From the errors, it looks like ptxas isn’t in your path and so can’t be found by tensorflow.

Have you inspected $PATH to see if it’s included?

Hello @amyeroberts,

Thanks for the reply! My PATH looks as follows (excluding some Windows-related paths):

C:\Users\xxx\miniconda3\envs\transformers;C:\Users\xxx\miniconda3\envs\transformers\Library\mingw-w64\bin;C:\Users\xxx\miniconda3\envs\transformers\Library\usr\bin;C:\Users\xxx\miniconda3\envs\transformers\Library\bin;C:\Users\xxx\miniconda3\envs\transformers\Scripts;C:\Users\xxx\miniconda3\envs\transformers\bin;C:\Users\xxx\miniconda3\condabin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;

The ptxas does not seem to be explicitly listed there. However, when I run the following code:

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

print(tf.config.list_physical_devices('GPU'))

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = ["I've been waiting for a HuggingFace course my whole life.",
              "I hate this so much!"]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf")
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape) 
print(outputs.logits)

My output is:

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2022-08-12 01:12:50.298051: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-12 01:12:50.648730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13589 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080 Ti Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
2022-08-12 01:12:52.311273: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
(2, 2)
tf.Tensor(
[[-1.5614419  1.6128962]
 [ 4.16919   -3.3466015]], shape=(2, 2), dtype=float32)

So everything seems to be ok in that case. Also, if I run the following example code, which uses cuDNN explicitly:

# univariate stacked lstm example
from numpy import array
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
 
# split a univariate sequence
def split_sequence(sequence, n_steps):
	X, y = list(), list()
	for i in range(len(sequence)):
		# find the end of this pattern
		end_ix = i + n_steps
		# check if we are beyond the sequence
		if end_ix > len(sequence)-1:
			break
		# gather input and output parts of the pattern
		seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
		X.append(seq_x)
		y.append(seq_y)
	return array(X), array(y)
 
# define input sequence
raw_seq = [10, 20, 30, 40, 50, 60, 70, 80, 90]
# choose a number of time steps
n_steps = 3
# split into samples
X, y = split_sequence(raw_seq, n_steps)
# reshape from [samples, timesteps] into [samples, timesteps, features]
n_features = 1
X = X.reshape((X.shape[0], X.shape[1], n_features))
# define model
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape=(n_steps, n_features)))
model.add(LSTM(50))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
# fit model
model.fit(X, y, epochs=200, verbose=0)
# demonstrate prediction
x_input = array([70, 80, 90])
x_input = x_input.reshape((1, n_steps, n_features))
yhat = model.predict(x_input, verbose=0)
print(yhat)

I get:

2022-08-12 01:35:44.483534: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-12 01:35:44.805349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13589 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3080 Ti Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
2022-08-12 01:35:47.137256: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
2022-08-12 01:35:47.886823: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
[[19.614902]]

I am not sure that ptxas is the core issue here, perhaps a symptom of something else related to the SegFormer. Or what do you think?