Hosting Mistral 7b quantized 4bit

Hi there, I am trying to host a quantized fine tune version of Mistral 7b using the HF endpoint and I am running into the following error, any idea?

Server message:Endpoint failed to start. ndard error output:\n\nYou shouldn't move a model when it is dispatched on multiple devices.\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 89, in serve\n server.serve(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 235, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 649, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 196, in serve_inner\n model = get_model(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 387, in get_model\n return CausalLM(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py\", line 518, in __init__\n model = model.cuda()\n\n File \"/opt/conda/lib/python3.10/site-packages/accelerate/big_modeling.py\", line 426, in wrapper\n return fn(*args, **kwargs)\n\n File \"/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py\", line 2564, in cuda\n raise ValueError(\n\nValueError: Calling cuda()is not supported for4-bitor8-bitquantized models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correctdtype.\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]} {"timestamp":"2024-02-12T23:52:27.899804Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"} {"timestamp":"2024-02-12T23:52:27.899839Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"} Error: ShardCannotStart

Some people are saying it can’t run GGUF models, but I can’t find documentation on that.

I just ended up using a custom handler and loading it in with bitsandbytes. The text generation pipelines are too hard to debug. I was also able to get it up and running with AWQ and VLLM using a custom handler and with great inference speed.