Run Pixtral-12B-2409 locally

I have 64GB of RAM and an RTX 2060 with 6GB VRAM.
I am running the vLLM image in Docker using the following command:

docker run --runtime nvidia --gpus all \
    -v ./:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai \
    --model mistralai/Pixtral-12B-2409 \
    --tokenizer_mode mistral \
    --load_format mistral \
    --config_format mistral

However, when I send a request like this:

curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer token' \
--data '{
    "model": "mistralai/Pixtral-12B-2409",
    "messages": [
      {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image in detail please."},
            {"type": "image_url", "image_url": {"url": "https://s3.amazonaws.com/cms.ipressroom.com/338/files/201808/5b894ee1a138352221103195_A680%7Ejogging-edit/A680%7Ejogging-edit_hero.jpg"}}
        ]
      }
    ]
  }'

The request gets rejected, and after one or two attempts, the container crashes with an out of memory error.

Is there any way to run this model with my hardware?

1 Like

Your environment does not have enough VRAM… Ideally, you would need at least 25GB for Pixtral 12B. In addition to model loading, VRAM and RAM are also required for calculations during inference.

Also, I think that 20x0 generation GeForce cards are designed to be very slow when using bfloat16 (which has been the standard for almost all LLM since last year).

Anyway, I think that quantization during loading is essential if you want to use it in practice, but there seem to be some problems when quantizing Pixtral. From what I can tell from reading the github, I think that these problems have probably been resolved now…
Also, I think that you can find the quantized files on Hugging Face.

1 Like