Short, truncated answers

Hello,

I am having trouble getting meaningful dialog going with LLMs - is there something I am doing wrong below? Thanks so much for your generosity in helping out. I am trying the simplest code:

my_model = “meta-llama/Llama-2-7b-chat-hf”
#my_model = “google/flan-t5-xl”
llm = HuggingFaceHub(repo_id = my_model, model_kwargs={“temperature”:0.05, “max_length”:1024})
text = “Tell me about the seven Harry Potter novels in detail.”
print(llm(text))

It attempts an answer (I did get a pro subscription on huggingface so I can now use Llama models) “The seven Harry Potter novels are a series of seven fantasy novels written by J”. That is it, it stops mid-sentence at J. I tried running on my laptop without a GPU, and tried running on google colab choosing ‘T4 GPU’.

flan-t5-xl times out. flan-t5-base completes one sentence, “Harry Potter and the Philosopher’s Stone is a series of seven books written by Harry Potter and the Philosopher’s Stone.”

There is all this code with custom pdfs and vectorstores that I am playing with, but I am stymied by these curt answers. Thank you for any help and pointers. I am open to reasonable paid solutions, please share which models and infrastructure choices you have found to be cost-effective for user-friendly conversations.

Thanks again.

tbh I don’t know why it’s not working with u but anw I will share with u the code that I’ve used and it did work for me: (I tried this code on sagemaker notebook instance with 16GB GPU RAM)

model_id = 'meta-llama/Llama-2-7b-chat-hf'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_8bit=True,
    device_map='auto',
    torch_dtype=torch.float16)

model.eval()

hg_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=512,
    repetition_penalty=1.1,
    pad_token_id=tokenizer.eos_token_id
)

prompt = "What's the difference between AWS and GCP?"
output = hg_pipeline(prompt)
print(output[0]["generated_text"])

Thank you, Khalil. What did you use exactly for instance_type?

With your code, I am running into a new issue -

HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/tokenizer_config.json

This is strange, as I could access the same model before with my code. Looks like others have also encountered this error before, but I haven’t found a way to resolve it yet. Will keep trying to get your code to work for me to see what I get, thanks.

Hello Sonali, I was using ml.g4dn.xlarge as an instance type.

regarding the Unauthorized for url, not sure tbh, but did u login to your huggingface account? in order to get Llama 2 weight you have to submit a form here: Llama 2 - Meta AI then you have to login to Huggingface using the same email you used in Meta’s form, they will send u an email saying they approved your request. After that in your terminal login to huggingface account using this command: huggingface-cli login

they will ask you for a security token, which you can get from your account in the settings.

hope that helps!