I have 64GB of RAM and an RTX 2060 with 6GB VRAM .
I am running the vLLM image in Docker using the following command:
docker run --runtime nvidia --gpus all \
-v ./:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai \
--model mistralai/Pixtral-12B-2409 \
--tokenizer_mode mistral \
--load_format mistral \
--config_format mistral
However, when I send a request like this:
curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer token' \
--data '{
"model": "mistralai/Pixtral-12B-2409",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail please."},
{"type": "image_url", "image_url": {"url": "https://s3.amazonaws.com/cms.ipressroom.com/338/files/201808/5b894ee1a138352221103195_A680%7Ejogging-edit/A680%7Ejogging-edit_hero.jpg"}}
]
}
]
}'
The request gets rejected, and after one or two attempts, the container crashes with an out of memory error.
Is there any way to run this model with my hardware?
1 Like
Your environment does not have enough VRAM… Ideally, you would need at least 25GB for Pixtral 12B. In addition to model loading, VRAM and RAM are also required for calculations during inference.
Also, I think that 20x0 generation GeForce cards are designed to be very slow when using bfloat16 (which has been the standard for almost all LLM since last year).
Anyway, I think that quantization during loading is essential if you want to use it in practice, but there seem to be some problems when quantizing Pixtral. From what I can tell from reading the github, I think that these problems have probably been resolved now…
Also, I think that you can find the quantized files on Hugging Face.
opened 10:48PM - 14 Dec 24 UTC
misc
stale
### Anything you want to discuss about vllm.
What am I doing wrong here? When… I load, quantize (4 bit) and share the llama 3.1 8b instruct model across 2 A100s, it sucks up all the VRAM:
```
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=hf_............................" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 2 \
--quantization bitsandbytes \
--load_format bitsandbytes
```
GPU monitoring:
```
GPU monitoring:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:4F:00.0 Off | 0 |
| N/A 32C P0 64W / 300W | 72129MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:50:00.0 Off | 0 |
| N/A 35C P0 66W / 300W | 72111MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
```
This simply shouldn't be happening. I am 100% certain there is no other GPU process running as I am consistently monitoring the GPU when the model is loaded and when it isn't.
### Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.
opened 01:41PM - 18 Sep 24 UTC
closed 07:29PM - 18 Oct 24 UTC
feature request
### 🚀 The feature, motivation and pitch
In Linux, nvidia driver doesn't provide… "shared memory" like windows, which make it impossible to load Pixtral 12B into 3090 or 4090.
And since it looks like we don't have any transformers implementation of pixtral, we can only use vllm codebase to load the model.
Is it possible that vllm provide an option/API to create offline FP8 quantization through vllm model loader?
### Alternatives
Although I suggest a new feature like "making offline quantize through vllm library"
If vllm/mistral team can provide offline fp8 ckpt directly it is also good for me.
### Additional context
_No response_
### Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.
1 Like