Trying to understand system prompts with Llama 2 and transformers interface

dkettler · October 18, 2023, 6:04pm

I can’t get sensible results from Llama 2 with system prompt instructions using the transformers interface. Can somebody help me out here because I don’t understand what I’m doing wrong.

For the prompt I am following this format as I saw in the documentation: “[INST]\n<>\n{system_prompt}\n<>\n\n{user_prompt}[/INST]”. As an exercise (yes I realize using an LLM for this is complete overkill, it’s just an exercise) I have been attempting to instruct it to do sentiment analysis on the user prompt. So my system prompt is “Analyze the text in the content and evaluate the overall sentiment. Answer with just "Positive", "Negative", or "Neutral"” and the user prompt is just the text I want to analyze.

However the output just repeats the prompt back to me. That’s it.

I tried this in the chat interface at Llama 2 7B Chat - a Hugging Face Space by huggingface-projects, setting the system prompt under additional inputs. And there it does exactly what I would expect it to. So there must be something wrong with how I’m actually using this through code, but I’m not sure what it is.

Here is my code:

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto"
)

base_prompt = "<s>[INST]\n<<SYS>>\n{system_prompt}\n<</SYS>>\n\n{user_prompt}[/INST]"

def get_sentiment_llama(text):
    input = base_prompt.format(system_prompt = "Analyze the text in the content and evaluate the overall sentiment. Answer with just \"Positive\", \"Negative\", or \"Neutral\"",
                               user_prompt = text)
    print(input)
    
    sequences = pipeline(
        input,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,
        return_full_text=False,
        temperature=0.5
    )

    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

get_sentiment_llama("Ok, first assesment of the kindle2 ...it fucking rocks")

And this is the output:

<s>[INST]
<<SYS>>
Analyze the text in the content and evaluate the overall sentiment. Answer with just "Positive", "Negative", or "Neutral"
<</SYS>>

Ok, first assesment of the kindle2 ...it fucking rocks[/INST]
Result: 

[INST]
<<SYS>>
Analyze the text in the content and evaluate the overall sentiment. Answer with just "Positive", "Negative", or "Neutral"
<</SYS>>

Ok, first assesment of the kindle2...it fucking rocks[/INST]

[INST]
<<SYS>>
Analyze the text in the content and evaluate the overall sentiment. Answer with just "Positive", "Negative", or "Neutral"
<</SYS>>

Ok, first assesment

ckandemir · October 18, 2023, 6:32pm

@dkettler this is how I got mine working:

<<SYS>>
You're are a helpful Assistant, and you only response to the "Assistant"
Remember, maintain a natural tone. Be precise, concise, and casual. Keep it short\n
<</SYS>>
{conversation_history}\n\n
[INST]
User:{user_message}
[/INST]\n
Assistant:

And in your model pass the stop tokens

model(prompt=prompt, max_tokens=120, stop=["[INST]", "None", "User:"]
                   )

You can also follow this blog post here

dkettler · October 19, 2023, 4:15pm

@ckandemir Thank you for your response, but I’m following the pattern at Llama 2 is here - get it on Hugging Face with the transformers.pipeline interface and I’m not sure where I would add the stop option because I’m not initiating the model directly.

And that blog post is exactly what I’ve been trying to follow. My prompt matches that format, it just doesn’t work…

I also found some examples of prompt generating code at TheBloke/Llama-2-13B-chat-GPTQ · Prompt format but they don’t work for me either…so I don’t think it’s the prompt. But if there’s something else to change the documentation doesn’t seem to say what it is…

ckandemir · October 19, 2023, 8:14pm

@dkettler In that case you can do so by playing with the eos token. mentioned in here

print(tokenizer.eos_token)

To see what it is, and tweak your prompt and parameters accordingly.

If it still doesn’t work, another work around is using regex on the output:

import re

input_text = get_sentiment_llama("Ok, first assesment of the kindle2 ...it fucking rocks")

eos_token = tokenizer.eos_token

# Create a regex pattern to match the eos_token and everything that follows it
pattern = re.escape(eos_token) + '.*'

# Use re.sub() to replace the matched pattern with an empty string
cleaned_text = re.sub(pattern, '', input_text)

dkettler · October 20, 2023, 1:46am

@ckandemir My eos token is “</s>”. It’s not in my prompt though because at least according to the instructions at https://huggingface.co/blog/llama2#how-to-prompt-llama-2 that’s only used in a multi-turn conversation, which this isn’t. Should I still be including it?

Edit: Tried throwing it in there and now it seems to just return completely random text. But at least it returns something…

ckandemir · October 21, 2023, 12:03am

You can try to include it in your prompt to instruct the model to use that token at the end, or you do post processing with regex to manipulate the output

dkettler · October 22, 2023, 1:52am

I’m not sure I understand. You say post processing but there’s no sensible output to manipulate. If I don’t include the eos token I don’t get anything and if I do include it the text appears to be complete random, as if it’s being asked to just generate something with no prompt at all.

ckandemir · October 22, 2023, 2:52am

@dkettler
You do not get any sensible output if you follow the prompt you do here:



base_prompt = "<s>[INST]\n<<SYS>>\n{system_prompt}\n<</SYS>>\n\n{user_prompt}[/INST]"

def get_sentiment_llama(text):
    input = base_prompt.format(system_prompt = "Analyze the text in the content and evaluate the overall sentiment. Answer with just \"Positive\", \"Negative\", or \"Neutral\"",
                               user_prompt = text)
    print(input)
    
    sequences = pipeline(
        input,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,
        return_full_text=False,
        temperature=0.5
    )

    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

get_sentiment_llama("Ok, first assesment of the kindle2 ...it fucking rocks")

However when I implement the following prompt format into the code, and change some parameters in the pipeline:


def get_sentiment_llama(text):
    input = f"""
        <<SYS>>
        Analyze the text in the content and evaluate the overall sentiment for 'User'.\n
        And return your sentiment analysis only as : Positive, Negative, or Neutral \n
        <</SYS>>
        [INST]
        User:{text}
        [/INST]\n

        Assistant:
    """
    print(input)
    
    sequences = pipeline(
        input,
        do_sample=True,
        top_k=50,
        num_return_sequences=2,
        max_new_tokens=50,
        return_full_text=False,
        temperature=.9,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id

        )

    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
get_sentiment_llama("Ok, first assesment of the kindle2 ...it fucking rocks")

I get the following results:

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

        <<SYS>>
        Analyze the text in the content and evaluate the overall sentiment for 'User'.

        And return your sentiment analysis only as : Positive, Negative, or Neutral 

        <</SYS>>
        [INST]
        User:Ok, first assesment of the kindle2 ...it fucking rocks
        [/INST]


        Assistant:
    
Result: 
        Sentiment: Positive

    """
    def __init__(self):
        super().__init__()
        self.sentiment_parser = SentimentParser(model_path=SENTIMENT_PATH
Result: 
        <</SYS>>
        [INST]
        User:It fucking rocks!!
        [/INST]


        Analyzed Sentiment:

        <</SYS>>
        Positive

from here on you can perform post processing on your output to cut however you want it.

So basically by playing with your prompt and parameters you can get the model respond in the desired format, then you can perform post processing on your output.

You can refer to this documentation for parameters you can use

tanushreehere · April 29, 2024, 11:49am

I can see that you are using “meta-llama/Llama-2-7b-hf” here. I think you need to go with “meta-llama/Llama-2-7b-chat-hf” instead as this one is fine-tuned for chat/dialogue. This would give you sensible outputs.

pyrpyr · October 19, 2024, 6:08am

I’m using LLama-2-7b-hf model and im not really sure about the prompting. The idea is to only get an aswer to the last question with the help of the examples given to the model (in this case 1 example). Do you think I can prompt the llama model like this or should I use some tags like [INST] and <>?

Here is my code:

prompt = """

Please answer the question according to the context and candidate answers. Each candidate answer is associated with a confidence score within a bracket. The true answer may not be included in the candidate answers.

===

Context: Horses all running towards something while men stand around.

===

Question: Which game is played using these animals?

===

Candidates: race(0.98), polo(0.72), run(0.09), horse race(0.07), golf(0.04), ride(0.02), rugby(0.01), hunt(0.01), rodeo(0.01), farm(0.00)

===

Answer: race

===

Context: a black motorcycle parked in a parking lot.

===

Question: What sport can you use this for?

===

Candidates: race(0.53), motorcycle(0.41), motocross(0.19), bike(0.17), motorcross(0.15), cycling(0.11), dirt bike(0.10), ride(0.08), bicycling(0.01), bicycle(0.01)

===

Answer:

"""

def llama_infer(prompt_text):
        # Tokenize the input prompt text
        input_ids = tokenizer.encode(prompt_text, return_tensors='pt')
        # Move input and model to the appropriate device (GPU or CPU)
        device = "cuda" if torch.cuda.is_available() else "cpu"
        input_ids = input_ids.to(device)
        model.to(device)
        model.eval()
        
        with torch.no_grad():
                outputs = model(input_ids)
                logits = outputs.logits
        
        # Generate the output normally with the full input sequence
        output = model.generate(input_ids,
                                    max_new_tokens=8,
                                    num_return_sequences=1,
                                    temperature=0.6,
                                    top_p=0.9,
                                    no_repeat_ngram_size=2)

        # Decode the generated ids to a string
        generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
        response_text = generated_text[len(prompt_text):]
        
        return response_text

Topic		Replies	Views
Llama 2 repeats its prompt as output without answering the prompt 🤗Transformers	3	3638	September 30, 2024
Meta Llama-3 prompt sample Models	1	1885	July 21, 2024
Llama 2 don't reponse prompt invokes Models	0	404	February 9, 2024
Llama model outputs strange words Beginners	0	132	December 1, 2024
LLama 70B not working Beginners	1	1348	August 8, 2023

Trying to understand system prompts with Llama 2 and transformers interface

Related topics