How to use gated model in inference

Hi, I have obtained access to Meta llama3 models, and I am trying to use it for inference using the sample code from model card. When I run my inference script, it gives me an error 'Cannot access gated repo for url huggingface..../meta-llama.....config.json

So my question is how can I access this model from my inference script? Do I need to pass any authientication/api/token key to ensure my script can access the model?

Also how to get huggingface version of the model locally?

You could do it by making the code look like this.

Sample code

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Actual Code

import transformers
import torch

hf_token = "hf_*********" # When uploading code, never write directly!

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
    token=hf_token,
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

You can make as many tokens as you want here. You should name your tokens in such a way that you can easily identify which is which.

1 Like

Make sure to authenticate with huggingface-cli login in the terminal before running the script.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.