Performance of mtb-7b on mac M1

This is cross posted here: mosaicml/mpt-7b · mpt-7b taking several minutes on mac m1?. I wasn’t sure if this was a model problem or not when I posted it, but I figure it will help others more if in the beginners forum if there’s a config change i can make.

Running the pipeline below for max_new_tokens=2 characters takes 2 minutes (each token adds about 1 minute). Is this expected on a mac M1 (CPU, not Metal)? Other models work in a few seconds with similar code (gpt2, distilbert-base-cased-distilled-squad).

System: macOS 12.7.1, M1 Pro chip, 16 GB RAM

Included below:

  1. Source code. It’s just from here (mosaicml/mpt-7b · Hugging Face) but without cuda. I’ve started reading about Apple Metal which might be useful, but I’m not sure if it’s required. Example: CUDA for M1 MacBook Pro - MATLAB Answers - MATLAB Central
  2. Warnings
  3. Profile (cProfile)
  4. Some of the dependencies (maybe the most relevant one is torch @ https://download.pytorch.org/whl/cpu/torch-2.1.0-cp311-none-macosx_11_0_arm64.whl). List truncated to the more interesting ones to save space.

I’ve also tried:

  1. adding (with torch.autocast(‘cpu’, dtype=torch.float32)) around the pipeline run call.
  2. torch_dtype=torch.float32 in the model getter.
  3. Playing around with toggling do_sample and use_cache (I’m a bit new so I’m still learning what all the options are on here, and ML pipelines in general
  4. Trying device=“mps” on the pipe creation (it fails with an out of memory error: RuntimeError: MPS backend out of memory (MPS allocated: 18.02 GB, other allocations: 7.98 MB, max allowed: 18.13 GB). Tried to allocate 192.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

Code:

import cProfile
from datetime import datetime
import time
import transformers
from unittest import IsolatedAsyncioTestCase

class Unittest(IsolatedAsyncioTestCase):
    async def test_demo_mpt_7b_performance(self):
        model = transformers.AutoModelForCausalLM.from_pretrained(
            "mosaicml/mpt-7b",
            trust_remote_code=True)

        tokenizer = transformers.AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')

        pipe = transformers.pipeline('text-generation', model=model, tokenizer=tokenizer)

        print(f"starting pipe__at__{datetime.now().time()}")
        with cProfile.Profile() as pr:
            res = await self.print_duration(pipe,
                                      "Here is a recipe for vegan banana bread",
                                      max_new_tokens=2,
                                      do_sample=False,
                                      use_cache=True)

        pr.print_stats("cumulative")

        print(res)

Warnings:

.../mosaicml/mpt-7b/ada218f9a93b5f1c6dce48a4cc9ff01fcba431e7/configuration_mpt.py:90: DeprecationWarning: verbose argument for MPTConfig is now ignored and will be removed. Use python_log_level instead.
.../mosaicml/mpt-7b/ada218f9a93b5f1c6dce48a4cc9ff01fcba431e7/configuration_mpt.py:97: UserWarning: alibi is turned on, setting `learned_pos_emb` to `False.`
...einops/_torch_specific.py:108: ImportWarning: allow_ops_in_compiled_graph failed to import torch: ensure pytorch >=2.0
  warnings.warn("allow_ops_in_compiled_graph failed to import torch: ensure pytorch >=2.0", ImportWarning)
...Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
...utils.py:1518: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )

Profile:

Took 172.56s, with 43.58 s of process time__at__15:45:26.780838
         28817 function calls (26003 primitive calls) in 172.566 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002  172.567  172.567 hugging_face_forum_performance.py:33(print_duration)
        1    0.000    0.000  172.560  172.560 text_generation.py:167(__call__)
        1    0.001    0.001  172.559  172.559 base.py:1077(__call__)
        1    0.001    0.001  172.559  172.559 base.py:1145(run_single)
        1    0.002    0.002  172.510  172.510 base.py:1037(forward)
        1    0.001    0.001  172.504  172.504 text_generation.py:240(_forward)
      3/1    0.003    0.001  172.502  172.502 _contextlib.py:112(decorate_context)
        1    0.008    0.008  172.498  172.498 utils.py:1395(generate)
        1    0.006    0.006  172.487  172.487 utils.py:2411(greedy_search)
    780/2    0.003    0.000  172.432   86.216 module.py:1514(_wrapped_call_impl)
    780/2    0.016    0.000  172.432   86.216 module.py:1520(_call_impl)
        2    0.001    0.000  172.432   86.216 modeling_mpt.py:269(forward)
      258  172.160    0.667  172.160    0.667 {built-in method torch._C._nn.linear}
        2    0.010    0.005  167.512   83.756 modeling_mpt.py:146(forward)
       64    0.012    0.000  167.471    2.617 blocks.py:32(forward)
      256    0.002    0.000  167.244    0.653 linear.py:113(forward)
       64    0.002    0.000  112.702    1.761 ffn.py:23(forward)
       64    0.004    0.000   54.681    0.854 attention.py:263(forward)
        4    0.000    0.000    4.923    1.231 custom_embedding.py:7(forward)
       64    0.009    0.000    0.099    0.002 attention.py:48(scaled_multihead_dot_product_attention)
      130    0.003    0.000    0.058    0.000 norm.py:20(forward)
       68    0.052    0.001    0.052    0.001 {built-in method torch.cat}

Dependencies (subset):

accelerate==0.25.0
einops==0.7.0
numpy==1.26.2
pandas==2.1.4
pydantic==2.5.2
pydantic_core==2.14.5
python-dateutil==2.8.2
pytz==2023.3.post1
safetensors==0.4.1
tokenizers==0.15.0
torch @ https://download.pytorch.org/whl/cpu/torch-2.1.0-cp311-none-macosx_11_0_arm64.whl
tqdm==4.66.1
transformers==4.36.2