Im trying to load mlx-community/Mamba-Codestral-7B-v0.1-8bit model from disk.
from mlx_lm import load, generate
self.model, self.tokenizer = load("../models/Mamba-Codestral-7B-v0.1-8bit_model")
ERROR:root:Model type mamba2 not supported.
codestral-mistral-models | Error loading model mlx-community/Mamba-Codestral-7B-v0.1-8bit: Model type mamba2 not supported.
codestral-mistral-models | ERROR:root:Error checking if model is saved: Model type mamba2 not supported.
codestral-mistral-models | ERROR:root:Error loading model: Model type mamba2 not supported.
OR
from transformers import Mamba2ForCausalLM, AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained(self.save_dir)
self.model = AutoModelForCausalLM.from_pretrained(self.save_dir)
//self.model = Mamba2ForCausalLM.from_pretrained(self.save_dir)
Error loading model mlx-community/Mamba-Codestral-7B-v0.1-8bit: The model’s quantization config from the arguments has no quant_method
attribute. Make sure that the model has been correctly quantized…
Model is cloned using “git clone mlx-community/Mamba-Codestral-7B-v0.1-8bit · Hugging Face”
Model content:
.git
.gitattributes
README.md
config.json
model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
special_tokens_map.json
tokenizer.json
tokenizer.model
tokenizer_config.json
config.json:
{
"architectures": [
"Mamba2ForCausalLM"
],
"bos_token_id": 0,
"chunk_size": 256,
"conv_kernel": 4,
"eos_token_id": 0,
"expand": 2,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.1,
"intermediate_size": 8192,
"layer_norm_epsilon": 1e-05,
"model_type": "mamba2",
"n_groups": 8,
"norm_before_gate": true,
"num_heads": 128,
"num_hidden_layers": 64,
"pad_token_id": 0,
"quantization": {
"group_size": 64,
"bits": 8
},
"quantization_config": {
"group_size": 64,
"bits": 8
},
"rescale_prenorm_residual": false,
"residual_in_fp32": true,
"rms_norm": true,
"state_size": 128,
"tie_word_embeddings": false,
"time_step_floor": 0.0001,
"time_step_init_scheme": "random",
"time_step_limit": [
0.0,
Infinity
],
"time_step_max": 0.1,
"time_step_min": 0.001,
"time_step_rank": 256,
"time_step_scale": 1.0,
"torch_dtype": "bfloat16",
"transformers_version": "4.44.0.dev0",
"use_bias": false,
"use_cache": true,
"use_conv_bias": true,
"vocab_size": 32768
}
Dockerfile:
# Use a base image with PyTorch and GPU support
FROM huggingface/transformers-pytorch-gpu:latest
# Set working directory inside the container
WORKDIR /workspace
# Install necessary packages
RUN apt-get update && \
pip3 install mistral_inference>=1 mamba-ssm causal-conv1d && \
pip3 install langchain torch accelerate flask pydantic uvicorn mlx-lm && \
pip3 install python-dotenv langchain-community langchain-huggingface llama-index llama-index-embeddings-huggingface && \
pip3 install peft auto-gptq optimum bitsandbytes sentence-transformers numpy fastapi ...