Hello
Trying to load Llama2 model with the HuggingFacePipeline
In an AWS g5.4xlarge (1 Gpu-16Cpu-64GoCpu-24GoGpu) instance.
with the code below I have the below error.
I tried also
-With other type of instances
-Specifying 1 GPU
-Removing device-Auto
and I have the same errors.
I tried also just loading with
model = AutoModelForCausalLM.from_pretrained(âmeta-llama/Llama-2-7b-chat-hfâ, device_map=âautoâ)
and this loads the model well.
I think is the pipeline issue or link between langchain & accelerate?
Any Idea to use the pipeline?
thanks
JD
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
import tiktoken
from langchain import HuggingFacePipeline
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
llm = HuggingFacePipeline.from_model_id(
model_id=âmeta-llama/Llama-2-7b-chat-hfâ,
task=âtext-generationâ,
model_kwargs={
âtemperatureâ: 0,
âmax_lengthâ: 2048,
âtorch_dtypeâ: torch.bfloat16,
âdevice_mapâ: âautoâ,
âload_in_4bitâ: True
}
)
/home/ml-app/DATA_DESIGN/code-envs/python/py_310_sample_llm/lib/python3.9/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sample
is set to False
. However, temperature
is set to 0
â this flag is only used in sample-based generation modes. You should set do_sample=True
or unset temperature
. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
Loading checkpoint shards: 100%|ââââââââââ| 2/2 [00:04<00:00, 2.31s/it]
WARNING:langchain.llms.huggingface_pipeline:Device has 1 GPUs available. Provide device={deviceId} to from_model_id
to use availableGPUs for execution. deviceId is -1 (default) for CPU and can be a positive integer associated with CUDA device id.
ValueError: The model has been loaded with accelerate
and therefore cannot be moved to a specific device. Please discard the device
argument when creating your pipeline object.
Just in case more instance GPU info after executing the line.Before GPUs memory are free.
Tue Aug 29 09:31:16 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 30C P0 60W / 300W | 10134MiB / 23028MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 16298 C âŠpython/py_310_sample_llm/bin/python 7492MiB |
| 0 N/A N/A 19772 C âŠpython/py_310_sample_llm/bin/python 2624MiB |
±--------------------------------------------------------------------------------------+