Is, or will be, GPU accelerating supported on Mac device?

Dear all Developers,

Apple has just announced the TensorFlow-Metal package for GPU/NPU accelerating on Mac devices. Therefore, I am wondering that if it is feasible to solve NLP tasks with HuggingFace transformers through TensorFlow-macOS and TensorFlow-Metal.

To figure it out, I installed TensorFlow-macOS, TensorFlow-Metal, and HuggingFace on my local device. Then, I ran the testing code to check everything installed correctly, and here was what I got.

It seems everything works fine. But, I get the following error while I attempt to fine-tune a BERT model.

InvalidArgumentError: Cannot assign a device for operation tf_bert_for_sequence_classification/bert/embeddings/Gather: Could not satisfy explicit device specification '' because the node {{colocation_node tf_bert_for_sequence_classification/bert/embeddings/Gather}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0]. 
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
RealDiv: GPU CPU 
Sqrt: GPU CPU 
UnsortedSegmentSum: CPU 
AssignVariableOp: GPU CPU 
AssignSubVariableOp: GPU CPU 
ReadVariableOp: GPU CPU 
StridedSlice: GPU CPU 
NoOp: GPU CPU 
Mul: GPU CPU 
Shape: GPU CPU 
_Arg: GPU CPU 
ResourceScatterAdd: GPU CPU 
Unique: CPU 
AddV2: GPU CPU 
ResourceGather: GPU CPU 
Const: GPU CPU 

So, I checked that if TensorFlow detected GPU correctly, and here is what I had.

tf.test.is_gpu_available()
WARNING:tensorflow:From <ipython-input-2-17bb7203622b>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From <ipython-input-2-17bb7203622b>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2021-06-29 01:56:25.862829: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-06-29 01:56:25.862893: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Out[2]: True

It looks like that HuggingFace is unable to detect the proper device. Is there any way to solve this issue, or would be solved in near future?

I appreciate and looking forward to your kind assistance.

Sincerely,

hawkiyc

It doesn’t look like whatever device TF recongizes is actually usable - perhaps that may be the reason why HF can’t leverage it? if you can’t allocate tensors to GPU then there is little scope of executing ops there.

Dear Neel,

I have tested neural network models with TensorFlow only, and everything works fine. So, I think TF indeed detects and uses my GPU.

Sincerely,

hawkiyc

cc @Rocketknight1 who might have some insights on running TF + HF on M1 chips

unfortunately, its not really a conclusive test. I admit I haven’t used an M1 device, but as long as the framework can use a device, I don’t see why Huggingface won’t be able to.

Could you try doing a speed comparison for training models to ensure it’s not running on CPU?

Another thing - in your image

Metal Device set to: AMD Radeo Pro 5500M

Am I correct in assuming that you want to use an AMD GPU, not via the M1 processor?

Dear Neel,

I am sorry that I forgot to mention that I am using MacBook Pro 16". And this machine is intel CPU & AMD gpu, not M1.

I built a CNN model to test the device with the cifar10 dataset.

Here is time-consuming for each epoch with AMD GPU,

112] Plugin optimizer for device_type GPU is enabled.
1667/1667 [==============================] - 49s 27ms/step - loss: 1.3462 - sparse_categorical_accuracy: 0.5124
Epoch 2/5
1667/1667 [==============================] - 47s 28ms/step - loss: 0.8867 - sparse_categorical_accuracy: 0.6872
Epoch 3/5
1667/1667 [==============================] - 50s 30ms/step - loss: 0.7183 - sparse_categorical_accuracy: 0.7503
Epoch 4/5
1667/1667 [==============================] - 56s 33ms/step - loss: 0.6007 - sparse_categorical_accuracy: 0.7901
Epoch 5/5
1667/1667 [==============================] - 53s 32ms/step - loss: 0.5020 - sparse_categorical_accuracy: 0.8227
Out[3]: <tensorflow.python.keras.callbacks.History at 0x169216af0>

And here is for CPU,

Epoch 1/5
1667/1667 [==============================] - 285s 170ms/step - loss: 1.3755 - sparse_categorical_accuracy: 0.5024
Epoch 2/5
1667/1667 [==============================] - 283s 170ms/step - loss: 0.9136 - sparse_categorical_accuracy: 0.6782
Epoch 3/5
  23/1667 [..............................] - ETA: 4:38 - loss: 0.7261 - sparse_categorical_accuracy: 0.7333
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)

Sincerely,

ahh, so you are using tensorflow-metal to accelerate the training on AMD GPUs.

Well, I am sure its not very different than the standard tensorflow we use, unfortunately, it doesn’t seem to be open source.

I am not from Huggingface, so I can’t clarify but having a fork not being open source makes it quite difficult to work it; hence the lack of integration of HF+TF_metal. It doesn’t make sense for them to invest in a framework used by so few. You would have to do customizations yourself and figure out how to do that - which I think is a pretty daunting task.

Someone else might be able to explain better I suppose; I don’t know much about the Mac ecosystem unfortunately :hugs:

Dear All,

I have made another BERT model with TensorFlow-Hub only, and I got the same error as before.

InvalidArgumentError: Cannot assign a device for operation AdamWeightDecay/AdamWeightDecay/update/Unique: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
RealDiv: GPU CPU 
ResourceGather: GPU CPU 
AddV2: GPU CPU 
Sqrt: GPU CPU 
Unique: CPU 
ResourceScatterAdd: GPU CPU 
UnsortedSegmentSum: CPU 
AssignVariableOp: GPU CPU 
AssignSubVariableOp: GPU CPU 
ReadVariableOp: GPU CPU 
NoOp: GPU CPU 
Mul: GPU CPU 
Shape: GPU CPU 
Identity: GPU CPU 
StridedSlice: GPU CPU 
_Arg: GPU CPU 
Const: GPU CPU 

So, this issue should from TensorFlow-Hub, not HuggingFace. I will report this issue to Apple Developer Forum. Anyway, thank you all good fellows.

Sincerely,

hawkiyc

1 Like