Hi All, I am enjoying the transformers library but unfortunately it is very slow. It takes 15-30 minutes for a small request with llama7b. In comparison, this takes a few seconds when I run it with ollama.ai. My setup is just using the pipeline method, has anyone else seen issues? My setup is:
conda create --name pytorch39 python=3.9
conda activate pytorch39
conda install -c huggingface transformers
conda install pytorch-nightly::pytorch torchvision torchaudio -c pytorch-nightly
conda install -c anaconda pillow libtiff
conda install -c conda-forge accelerate einops
huggingface-cli login
from transformers import pipeline
from transformers import AutoTokenizer
import transformers
import torch
transformers.logging.set_verbosity_debug()
model = "meta-llama/Llama-2-7b-chat-hf" # meta-llama/Llama-2-7b-hf
llama_pipeline = pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model, use_auth_token=True)
prompt = 'Can you explain why grass is green?'
sequences = llama_pipeline(
prompt,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=256,
)
print("Chatbot:", sequences[0]['generated_text'])