HuggingFacePipeline Llama2 load_in_4bit from_model_id the model has been loaded with `accelerate` and therefore cannot be moved to a specific device

Trying to load Llama2 model with the HuggingFacePipeline
In an AWS g5.4xlarge (1 Gpu-16Cpu-64GoCpu-24GoGpu) instance.
with the code below I have the below error.

I tried also
-With other type of instances
-Specifying 1 GPU
-Removing device-Auto
and I have the same errors.

I tried also just loading with
model = AutoModelForCausalLM.from_pretrained(“meta-llama/Llama-2-7b-chat-hf”, device_map=“auto”)
and this loads the model well.
I think is the pipeline issue or link between langchain & accelerate?

Any Idea to use the pipeline?

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
import tiktoken
from langchain import HuggingFacePipeline
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

llm = HuggingFacePipeline.from_model_id(
“temperature”: 0,
“max_length”: 2048,
“torch_dtype”: torch.bfloat16,
“device_map”: “auto”,
“load_in_4bit”: True



/home/ml-app/DATA_DESIGN/code-envs/python/py_310_sample_llm/lib/python3.9/site-packages/transformers/generation/ UserWarning: do_sample is set to False. However, temperature is set to 0 – this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00, 2.31s/it]
WARNING:langchain.llms.huggingface_pipeline:Device has 1 GPUs available. Provide device={deviceId} to from_model_id to use availableGPUs for execution. deviceId is -1 (default) for CPU and can be a positive integer associated with CUDA device id.

ValueError: The model has been loaded with accelerate and therefore cannot be moved to a specific device. Please discard the device argument when creating your pipeline object.

Just in case more instance GPU info after executing the line.Before GPUs memory are free.

Tue Aug 29 09:31:16 2023
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 30C P0 60W / 300W | 10134MiB / 23028MiB | 0% Default |
| | | N/A |

| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| 0 N/A N/A 16298 C …python/py_310_sample_llm/bin/python 7492MiB |
| 0 N/A N/A 19772 C …python/py_310_sample_llm/bin/python 2624MiB |

1 Like

It’s probably the pipeline on langchain side. Just like you said, loading with model = AutoModelForCausalLM.from_pretrained(“meta-llama/Llama-2-7b-chat-hf”, device_map=“auto”) works. There is nothing we can do on our side.