Run Pixtral-12B-2409 locally

ak400400 · March 28, 2025, 7:57am

I have 64GB of RAM and an RTX 2060 with 6GB VRAM.
I am running the vLLM image in Docker using the following command:

docker run --runtime nvidia --gpus all \
    -v ./:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai \
    --model mistralai/Pixtral-12B-2409 \
    --tokenizer_mode mistral \
    --load_format mistral \
    --config_format mistral

However, when I send a request like this:

curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer token' \
--data '{
    "model": "mistralai/Pixtral-12B-2409",
    "messages": [
      {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image in detail please."},
            {"type": "image_url", "image_url": {"url": "https://s3.amazonaws.com/cms.ipressroom.com/338/files/201808/5b894ee1a138352221103195_A680%7Ejogging-edit/A680%7Ejogging-edit_hero.jpg"}}
        ]
      }
    ]
  }'

The request gets rejected, and after one or two attempts, the container crashes with an out of memory error.

Is there any way to run this model with my hardware?

John6666 · March 28, 2025, 9:53am

Your environment does not have enough VRAM… Ideally, you would need at least 25GB for Pixtral 12B. In addition to model loading, VRAM and RAM are also required for calculations during inference.

Also, I think that 20x0 generation GeForce cards are designed to be very slow when using bfloat16 (which has been the standard for almost all LLM since last year).

Anyway, I think that quantization during loading is essential if you want to use it in practice, but there seem to be some problems when quantizing Pixtral. From what I can tell from reading the github, I think that these problems have probably been resolved now…
Also, I think that you can find the quantized files on Hugging Face.

github.com/vllm-project/vllm

[Misc]: Problem quantizing model while using vllm in docker container

opened 10:48PM - 14 Dec 24 UTC

PhilipAmadasun

misc stale

### Anything you want to discuss about vllm. What am I doing wrong here? When… I load, quantize (4 bit) and share the llama 3.1 8b instruct model across 2 A100s, it sucks up all the VRAM: ``` docker run --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=hf_............................" \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3.1-8B-Instruct \ --tensor-parallel-size 2 \ --quantization bitsandbytes \ --load_format bitsandbytes ``` GPU monitoring: ``` GPU monitoring: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100 80GB PCIe Off | 00000000:4F:00.0 Off | 0 | | N/A 32C P0 64W / 300W | 72129MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100 80GB PCIe Off | 00000000:50:00.0 Off | 0 | | N/A 35C P0 66W / 300W | 72111MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | ``` This simply shouldn't be happening. I am 100% certain there is no other GPU process running as I am consistently monitoring the GPU when the model is loaded and when it isn't. ### Before submitting a new issue... - [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

github.com/vllm-project/vllm

[Feature]: Offline quantization for Pixtral-12B

opened 01:41PM - 18 Sep 24 UTC

closed 07:29PM - 18 Oct 24 UTC

KohakuBlueleaf

feature request

### 🚀 The feature, motivation and pitch In Linux, nvidia driver doesn't provide… "shared memory" like windows, which make it impossible to load Pixtral 12B into 3090 or 4090. And since it looks like we don't have any transformers implementation of pixtral, we can only use vllm codebase to load the model. Is it possible that vllm provide an option/API to create offline FP8 quantization through vllm model loader? ### Alternatives Although I suggest a new feature like "making offline quantize through vllm library" If vllm/mistral team can provide offline fp8 ckpt directly it is also good for me. ### Additional context _No response_ ### Before submitting a new issue... - [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Topic		Replies	Views
Host a Model with vllm for RAG Models	6	3582	September 12, 2024
Running Mistral-7B-Instruct-v0.2 on multiple GPUs Beginners	4	4297	March 13, 2024
Recommended hardware for running LLMs locally Beginners	2	33042	December 18, 2023
Best LLMs that can run on 4gb VRAM Beginners	2	3066	January 22, 2025
CUDA Out of Memory Error When Training Specific Layers 🤗Transformers	6	369	November 2, 2024

Run Pixtral-12B-2409 locally

Related topics