M2 Max GPU utilization steadily dropping while running inference with huggingface distilbert-base-cased

MacBook Pro M2 Max 96gb macOS 13.3 tensorflow-macos 2.9.0 tensorflow-metal 0.5.0

Repro Code:

from transformers import AutoTokenizer, TFDistilBertForSequenceClassification
from datasets import load_dataset
import tqdm

imdb = load_dataset('imdb')
sentences = imdb['train']['text'][:500]

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')

for i, sentence in tqdm(enumerate(sentences)):
  inputs = tokenizer(sentence, truncation=True, return_tensors='tf')
  output = model(inputs).logits
  pred = np.argmax(output.numpy(), axis=1)

  if i % 100 == 0:
    print(f"len(input_ids): {inputs['input_ids'].shape[-1]}")

It comes excruciating slow after about 300-400th record. It even dropped below 2% (smaller than that Window Server proc). Here are the prints:

Metal device set to: Apple M2 Max

systemMemory: 96.00 GB
maxCacheSize: 36.00 GB

3it [00:00, 10.87it/s]
len(input_ids): 391
101it [00:13,  6.38it/s]
len(input_ids): 215
201it [00:34,  4.78it/s]
len(input_ids): 237
301it [00:55,  4.26it/s]
len(input_ids): 256
401it [01:54,  1.12it/s]
len(input_ids): 55
500it [03:40,  2.27it/s]

I am aware this loop looks wrong:

  1. Use batches for GPU
  2. Use CPU if you want to do it one at a time.

But it is still unsettling to observe the GPU utilization decay 'cos I don’t think this happens on colab (or just linux with CUDA). So it has something to do with Apple Meta Silicon.

Just wonder what could the root cause be. If a bug is indeed lurking around, this may rear its head on me when I do longer bigger real training.

I logged an issue for TF: M2 GPU utilization decays from 50% to 10% in non batched inference for huggingface distilbert-base-cased · Issue #60271 · tensorflow/tensorflow · GitHub

I also posted a “workaround” there.

Sad news is I did a real training trial, fine-tuning distill Bert (with TF). I used batch_size=128 (could be the culprit) and first 200 steps went ok… but then started hitting error:

Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Internal Error (0000000e:Internal Error)
	<AGXG14XFamilyCommandBuffer: 0xf259be4d0>
    label = <none> 
    device = <AGXG14CDevice: 0x1196d8200>
        name = Apple M2 Max 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x2d9e5e800>
        label = <none> 
        device = <AGXG14CDevice: 0x1196d8200>
            name = Apple M2 Max 
    retainedReferences = 1

These errors seemed quite ominous, and I waited till it completed 1 epoch, which is like >3 slower than T4 (on colab), and quite bad accuracy (this could be due to large batch_size => less # steps/epoch).

I hope to reduce the batch_size and see when this will go away. Even if small batch_size is desired for my dataset size and fine tuning, it is still worrying to see it failed at “only” batch_size of 128. I got 96gb for the reason of exploring larger batch size… if I have to reduce this, it may be cheaper to just go with Nvidia (at risk of losing mobility and power cost).

Update: Again, I tracked this may as well due to unequal input length during training. I switched to max_len + padding and ensure 512 tokens at every batch and now I am getting perf on par with T4, and >90% GPU utilization. This is really pointing at a TF-Metal specific bug. Will try larger batch next, and get the $$ worth. M2Max is $$.