Utilizing GPU for ggml model with Langchain

I’m currently using a ggml-format model (13b-chimera.ggmlv3.q4_1.bin) in an app using Langchain. I’ve found that the program is still only using the CPU, despite running it on a VM with a GPU.

I’ve tried using the line torch.cuda.set_device(torch.device("cuda:0")) which returns the error AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'. But running the line torch.cuda.is_available() returns true, indicating that CUDA can find the GPU.

Can ggml models work with GPUs in the first place? Do I need to use another format like GPTQ? If so, how would I implement it in the program, since ggml models are a single .bin file while GPTQ models seem to be a collection of files?

Can check out the GPU instrucitons: https://python.langchain.com/docs/integrations/llms/llamacpp#gpu