MacBook Pro M2 Max 96gb macOS 13.3 tensorflow-macos 2.9.0 tensorflow-metal 0.5.0
Repro Code:
from transformers import AutoTokenizer, TFDistilBertForSequenceClassification
from datasets import load_dataset
import tqdm
imdb = load_dataset('imdb')
sentences = imdb['train']['text'][:500]
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')
for i, sentence in tqdm(enumerate(sentences)):
inputs = tokenizer(sentence, truncation=True, return_tensors='tf')
output = model(inputs).logits
pred = np.argmax(output.numpy(), axis=1)
if i % 100 == 0:
print(f"len(input_ids): {inputs['input_ids'].shape[-1]}")
It comes excruciating slow after about 300-400th record. It even dropped below 2% (smaller than that Window Server proc). Here are the prints:
Metal device set to: Apple M2 Max
systemMemory: 96.00 GB
maxCacheSize: 36.00 GB
3it [00:00, 10.87it/s]
len(input_ids): 391
101it [00:13, 6.38it/s]
len(input_ids): 215
201it [00:34, 4.78it/s]
len(input_ids): 237
301it [00:55, 4.26it/s]
len(input_ids): 256
401it [01:54, 1.12it/s]
len(input_ids): 55
500it [03:40, 2.27it/s]
I am aware this loop looks wrong:
- Use batches for GPU
- Use CPU if you want to do it one at a time.
But it is still unsettling to observe the GPU utilization decay 'cos I don’t think this happens on colab (or just linux with CUDA). So it has something to do with Apple Meta Silicon.
Just wonder what could the root cause be. If a bug is indeed lurking around, this may rear its head on me when I do longer bigger real training.