Different Summary Outputs Locally vs API for the Same Text

Hi Team,

Whilst using the inference API to produce summaries of call using a private model I sometimes get different outputs to when I load the model and tokeniser locally even though I’m using the exact same parameters.

To generate the summary locally I run:

device = 'cuda' if cuda.is_available() else 'cpu'
inputs = tokenizer(txt, return_tensors='pt')
summary_ids = model.generate(inputs['input_ids'].to(device), no_repeat_ngram_size = 2, max_length=75, top_k = 50, top_p=0.95, early_stopping = True)
summary_3 = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]

to generate the summary using pipeline / inference api I run:

output = query({
    "inputs": txt,
    "parameters" : {"max_length": 75,
        "no_repeat_ngram_size": 2,
         "early_stopping": True,
        "top_k": 50,
        "top_p": 0.95},
})

output[0]['summary_text']

or

summarizer = pipeline("summarization",  model="kaizan/production-bart-large-cnn-samsum", use_auth_token=TOKEN)
output = summarizer(txt, max_length=75, no_repeat_ngram_size = 2, top_k = 50, top_p = 0.95, early_stopping = True)
output[0]['summary_text']

In the pipeline / inference API case I get exactly the same output but when I run it manually I get a different output which makes me think there is a variable / seed value that might be set in the pipeline somewhere which I’m not dealing with when running this set up manually. Does anyone know what the variable causing this difference could be?

Thanks,

Karim

Hey Karim,

thanks for opening the Thread. When you said “Inference API” are you talking about the hosted inference API (Overview — Api inference documentation) or did you deploy a Model to Amazon SageMaker?

Hey @philschmid thanks for your response. I’ve experienced this issue with the hosted inference API but suspect it might be the same with Amazon SageMaker as well they both use the pipeline model if I’m not mistaken?

pinging @Narsil since he is the master of the Inference-API and might know more.

1 Like

Hi @kmfoda ,

Do you mind sharing why you want to achieve this ? It might help us understand the underlying issue. In general getting 1-1 on non deterministic generation is hard to get (specially on GPU where floating errors can creep up too). Internally we check for 1-step generation 1-1, but for long generations, drift can happen so we have to focus on the core issue which is, is the summarization good ? (i.e. does it contain the key elements we expect).

Pipeline (and therefore the API) makes no attempt to control the underlying seed. And it can be in arbitrary state (since the API is a long running job and might run other code at any given time).

The only causes I can think of variance unaccounted for are model.eval() to disable any kind of dropout/batch norm issues. And the pipeline also uses with torch.inference_mode() which deactivates gradients calculation at least (I have no clue if this impacts RNG state).

Does this help ?

Cheers,
Nicolas

Thanks @Narsil. I’m currently using @philschmid’s amazing model using the inference API to summarise transcripts.

A number of errors in the output where highlighted to me by my team where the output was hallucinating a 3rd person in a 2 person call and so to debug this I was running the model manually in a Colab notebook using exactly the same parameters. I realised that I was not getting the error in output when I load the model manually whereas when I run it in the pipeline I do get the error. I understand that the models generate in a non deterministic way but what’s confusing to me is that the output from the model loaded manually or via the pipeline never change. They’re always the exact same (re-ran 20 times) and the pipeline output always has the hallucination whilst the manual one does not. I’m looking to understand what causes the difference in output and if there’s a variable I had used in my experiments that’s overwritten in the API. I tried with and with model.eval() and with torch.inference_mode() and the manual loaded model output didn’t change.

I can’t share the data but I can mask it and share a redacted version here if that helps:

Manual model loading output (Plausible summary):

Nicolas and Phil are going to give their team access to the AI summarizing some calls next week.

Pipeline model loading output (Improbable summary):

Nicolas, Phil and HuggingFace are going to give Nicolas access to the AI summarizing some calls next week.

Hi @kmfoda ,

You got my curiosity so I tried to reproduce on a small example.
The issue (at least in my simple test case) is linked to the model.config.prefix which is set to " ". By default the pipeline will prefix the prompt with the prefix (some models require custom prefixes prompts so they can be put in the config to avoid tedious prompting all the time).

@philschmid Is that prefix intended ?

import torch
import os
from transformers import pipeline

TOKEN = os.getenv("HF_API_TOKEN")

txt = "Nicolas and Phil are going to give their team access to the AI summarizing some calls next week."
summarizer = pipeline("summarization", model="kaizan/production-bart-large-cnn-samsum", use_auth_token=TOKEN, device=0)


def naive():
    device = "cuda" if torch.cuda.is_available() else "cpu"

    tokenizer = summarizer.tokenizer
    model = summarizer.model

    inputs = tokenizer(txt, return_tensors="pt")
    summary_ids = model.generate(
        inputs["input_ids"].to(device),
        no_repeat_ngram_size=2,
        max_length=75,
        top_k=50,
        top_p=0.95,
        early_stopping=True,
    )
    print("=" * 10)
    summary_3 = tokenizer.decode(summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
    print(summary_3)
    print("=" * 10)


def naive_prefix():
    device = "cuda" if torch.cuda.is_available() else "cpu"

    tokenizer = summarizer.tokenizer
    model = summarizer.model

    inputs = tokenizer(" " + txt, return_tensors="pt")
    summary_ids = model.generate(
        inputs["input_ids"].to(device),
        no_repeat_ngram_size=2,
        max_length=75,
        top_k=50,
        top_p=0.95,
        early_stopping=True,
    )
    print("+" * 10)
    summary_3 = tokenizer.decode(summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
    print(summary_3)
    print("+" * 10)


def pipe():
    summary_3 = summarizer(
        txt,
        no_repeat_ngram_size=2,
        max_length=75,
        top_k=50,
        top_p=0.95,
        early_stopping=True,
    )
    print("-" * 10)
    print(summary_3[0]["summary_text"])
    print("-" * 10)


naive()
naive_prefix()
pipe()
1 Like

Thanks so much for that analysis @Narsil. Great spot! I get the same output now if I add the " " prefix. I’ve found though that on average the output seems better without the prefix. I might remove it for my specific use case but I’m just wondering @philschmid wether there is any rational as to why we would use the prefix?