Trying to understand system prompts with Llama 2 and transformers interface

I can’t get sensible results from Llama 2 with system prompt instructions using the transformers interface. Can somebody help me out here because I don’t understand what I’m doing wrong.

For the prompt I am following this format as I saw in the documentation: “[INST]\n<>\n{system_prompt}\n<>\n\n{user_prompt}[/INST]”. As an exercise (yes I realize using an LLM for this is complete overkill, it’s just an exercise) I have been attempting to instruct it to do sentiment analysis on the user prompt. So my system prompt is “Analyze the text in the content and evaluate the overall sentiment. Answer with just "Positive", "Negative", or "Neutral"” and the user prompt is just the text I want to analyze.

However the output just repeats the prompt back to me. That’s it.

I tried this in the chat interface at Llama 2 7B Chat - a Hugging Face Space by huggingface-projects, setting the system prompt under additional inputs. And there it does exactly what I would expect it to. So there must be something wrong with how I’m actually using this through code, but I’m not sure what it is.

Here is my code:

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto"
)

base_prompt = "<s>[INST]\n<<SYS>>\n{system_prompt}\n<</SYS>>\n\n{user_prompt}[/INST]"

def get_sentiment_llama(text):
    input = base_prompt.format(system_prompt = "Analyze the text in the content and evaluate the overall sentiment. Answer with just \"Positive\", \"Negative\", or \"Neutral\"",
                               user_prompt = text)
    print(input)
    
    sequences = pipeline(
        input,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,
        return_full_text=False,
        temperature=0.5
    )

    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

get_sentiment_llama("Ok, first assesment of the kindle2 ...it fucking rocks")

And this is the output:

<s>[INST]
<<SYS>>
Analyze the text in the content and evaluate the overall sentiment. Answer with just "Positive", "Negative", or "Neutral"
<</SYS>>

Ok, first assesment of the kindle2 ...it fucking rocks[/INST]
Result: 

[INST]
<<SYS>>
Analyze the text in the content and evaluate the overall sentiment. Answer with just "Positive", "Negative", or "Neutral"
<</SYS>>

Ok, first assesment of the kindle2...it fucking rocks[/INST]

[INST]
<<SYS>>
Analyze the text in the content and evaluate the overall sentiment. Answer with just "Positive", "Negative", or "Neutral"
<</SYS>>

Ok, first assesment

@dkettler this is how I got mine working:

<<SYS>>
You're are a helpful Assistant, and you only response to the "Assistant"
Remember, maintain a natural tone. Be precise, concise, and casual. Keep it short\n
<</SYS>>
{conversation_history}\n\n
[INST]
User:{user_message}
[/INST]\n
Assistant:

And in your model pass the stop tokens

model(prompt=prompt, max_tokens=120, stop=["[INST]", "None", "User:"]
                   )

You can also follow this blog post here

2 Likes

@ckandemir Thank you for your response, but I’m following the pattern at Llama 2 is here - get it on Hugging Face with the transformers.pipeline interface and I’m not sure where I would add the stop option because I’m not initiating the model directly.

And that blog post is exactly what I’ve been trying to follow. My prompt matches that format, it just doesn’t work…

I also found some examples of prompt generating code at TheBloke/Llama-2-13B-chat-GPTQ · Prompt format but they don’t work for me either…so I don’t think it’s the prompt. But if there’s something else to change the documentation doesn’t seem to say what it is…

@dkettler In that case you can do so by playing with the eos token. mentioned in here

print(tokenizer.eos_token)

To see what it is, and tweak your prompt and parameters accordingly.

If it still doesn’t work, another work around is using regex on the output:

import re

input_text = get_sentiment_llama("Ok, first assesment of the kindle2 ...it fucking rocks")

eos_token = tokenizer.eos_token

# Create a regex pattern to match the eos_token and everything that follows it
pattern = re.escape(eos_token) + '.*'

# Use re.sub() to replace the matched pattern with an empty string
cleaned_text = re.sub(pattern, '', input_text)

@ckandemir My eos token is “</s>”. It’s not in my prompt though because at least according to the instructions at https://huggingface.co/blog/llama2#how-to-prompt-llama-2 that’s only used in a multi-turn conversation, which this isn’t. Should I still be including it?

Edit: Tried throwing it in there and now it seems to just return completely random text. But at least it returns something…

You can try to include it in your prompt to instruct the model to use that token at the end, or you do post processing with regex to manipulate the output

I’m not sure I understand. You say post processing but there’s no sensible output to manipulate. If I don’t include the eos token I don’t get anything and if I do include it the text appears to be complete random, as if it’s being asked to just generate something with no prompt at all.

@dkettler
You do not get any sensible output if you follow the prompt you do here:



base_prompt = "<s>[INST]\n<<SYS>>\n{system_prompt}\n<</SYS>>\n\n{user_prompt}[/INST]"

def get_sentiment_llama(text):
    input = base_prompt.format(system_prompt = "Analyze the text in the content and evaluate the overall sentiment. Answer with just \"Positive\", \"Negative\", or \"Neutral\"",
                               user_prompt = text)
    print(input)
    
    sequences = pipeline(
        input,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,
        return_full_text=False,
        temperature=0.5
    )

    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

get_sentiment_llama("Ok, first assesment of the kindle2 ...it fucking rocks")

However when I implement the following prompt format into the code, and change some parameters in the pipeline:


def get_sentiment_llama(text):
    input = f"""
        <<SYS>>
        Analyze the text in the content and evaluate the overall sentiment for 'User'.\n
        And return your sentiment analysis only as : Positive, Negative, or Neutral \n
        <</SYS>>
        [INST]
        User:{text}
        [/INST]\n

        Assistant:
    """
    print(input)
    
    sequences = pipeline(
        input,
        do_sample=True,
        top_k=50,
        num_return_sequences=2,
        max_new_tokens=50,
        return_full_text=False,
        temperature=.9,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id

        )

    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
get_sentiment_llama("Ok, first assesment of the kindle2 ...it fucking rocks")

I get the following results:

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

        <<SYS>>
        Analyze the text in the content and evaluate the overall sentiment for 'User'.

        And return your sentiment analysis only as : Positive, Negative, or Neutral 

        <</SYS>>
        [INST]
        User:Ok, first assesment of the kindle2 ...it fucking rocks
        [/INST]


        Assistant:
    
Result: 
        Sentiment: Positive

    """
    def __init__(self):
        super().__init__()
        self.sentiment_parser = SentimentParser(model_path=SENTIMENT_PATH
Result: 
        <</SYS>>
        [INST]
        User:It fucking rocks!!
        [/INST]


        Analyzed Sentiment:

        <</SYS>>
        Positive

from here on you can perform post processing on your output to cut however you want it.

So basically by playing with your prompt and parameters you can get the model respond in the desired format, then you can perform post processing on your output.

You can refer to this documentation for parameters you can use

I can see that you are using “meta-llama/Llama-2-7b-hf” here. I think you need to go with “meta-llama/Llama-2-7b-chat-hf” instead as this one is fine-tuned for chat/dialogue. This would give you sensible outputs.

I’m using LLama-2-7b-hf model and im not really sure about the prompting. The idea is to only get an aswer to the last question with the help of the examples given to the model (in this case 1 example). Do you think I can prompt the llama model like this or should I use some tags like [INST] and <>?

Here is my code:

prompt = """

Please answer the question according to the context and candidate answers. Each candidate answer is associated with a confidence score within a bracket. The true answer may not be included in the candidate answers.

===

Context: Horses all running towards something while men stand around.

===

Question: Which game is played using these animals?

===

Candidates: race(0.98), polo(0.72), run(0.09), horse race(0.07), golf(0.04), ride(0.02), rugby(0.01), hunt(0.01), rodeo(0.01), farm(0.00)

===

Answer: race

===

Context: a black motorcycle parked in a parking lot.

===

Question: What sport can you use this for?

===

Candidates: race(0.53), motorcycle(0.41), motocross(0.19), bike(0.17), motorcross(0.15), cycling(0.11), dirt bike(0.10), ride(0.08), bicycling(0.01), bicycle(0.01)

===

Answer:

"""

def llama_infer(prompt_text):
        # Tokenize the input prompt text
        input_ids = tokenizer.encode(prompt_text, return_tensors='pt')
        # Move input and model to the appropriate device (GPU or CPU)
        device = "cuda" if torch.cuda.is_available() else "cpu"
        input_ids = input_ids.to(device)
        model.to(device)
        model.eval()
        
        with torch.no_grad():
                outputs = model(input_ids)
                logits = outputs.logits
        
        # Generate the output normally with the full input sequence
        output = model.generate(input_ids,
                                    max_new_tokens=8,
                                    num_return_sequences=1,
                                    temperature=0.6,
                                    top_p=0.9,
                                    no_repeat_ngram_size=2)

        # Decode the generated ids to a string
        generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
        response_text = generated_text[len(prompt_text):]
        
        return response_text