I’m currently using a ggml-format model (13b-chimera.ggmlv3.q4_1.bin) in an app using Langchain. I’ve found that the program is still only using the CPU, despite running it on a VM with a GPU.
I’ve tried using the line torch.cuda.set_device(torch.device("cuda:0"))
which returns the error AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
. But running the line torch.cuda.is_available()
returns true, indicating that CUDA can find the GPU.
Can ggml models work with GPUs in the first place? Do I need to use another format like GPTQ? If so, how would I implement it in the program, since ggml models are a single .bin file while GPTQ models seem to be a collection of files?