Inconsistent Bleu score between test_metrics['test_bleu'] and written-to-file test_metric.predictions

jenniferL · May 25, 2021, 1:46am

I got a bleu score at about 11 and would like to do some error analysis, so I saved the predictions to file. When I read the predictions, I felt that the bleu score should be much lower than 11 because most tokens in the references are missing in the predictions. Therefore, I directly calculated the bleu score by giving the predictions file and references file to sacrebleu (which is the package used as metric in the training program) and the bleu score is about 2. The predictions and references files are both formatted one sentence a line. Each predicted sentence has only one reference.

Relevant code snippets are attached below:


import sacrebleu
metric = load_metric("sacrebleu")

#----------------------------------------------------------#
# Define compute_metrics for trainer
#----------------------------------------------------------#

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

#----------------------------------------------------------#
# Calculate metric for test dataset, get bleu score, and save predictions to file
#----------------------------------------------------------#
test_metric = trainer.predict(test_dataset = tokenized_datasets['test'], metric_key_prefix = 'test', num_beams=6)
print(test_metric.metrics['test_bleu']) # get about 11

detokenized_predictions = tokenizer.batch_decode(test_metric.predictions, skip_special_tokens=True)
with open(path_predictions_file, 'w') as outfile:
    s = '\n'.join(detokenized_predictions)
    outfile.write(s)

#----------------------------------------------------------#
# load previously-saved predictions files and references file to calculate bleu score
#----------------------------------------------------------#
predictions = []
with open(path_predictions_file) as prediction_infile:
    for sentence in prediction_infile:
        predictions.append(sentence.strip())
references = []
with open(path_references_file) as reference_infile:
    for sentence in reference_infile:
        references.append(sentence.strip())

bleu = sacrebleu.corpus_bleu(predictions, [references])
print('{}'.format(bleu.format(score_only=True))) #get about 2

Thank you very much for the reading! Really appreciate any suggestions

sgugger · May 25, 2021, 1:07pm

The post-processing in the first example is not applied to the predictions written to disk.

jenniferL · May 25, 2021, 6:45pm

Hello @sgugger, thank you very much for your response

I have put strip() when loading predictions and references back (as below) which should have the same effect as to pass them through post_process() before writing to file.

predictions.append(sentence.strip())
references.append(sentence.strip())

There is one thing I felt confused I would like to bring up. The way compute_metrics() handles references is different from the way native sacrebleu.corpus_bleu() does. (I have one reference sentence per prediction sentence.) compute_metrics() takes references in a format of [[ref1], [ref2], [ref3],…]; while sacrebleu.corpus_bleu() takes references in a format of [[ref1, ref2, ref3]]. I thought this is the why they gave different bleu score. However, when I took a look at the source code of datasets/metrics/sacrebleu/sacrebleu.py, I think, after transformation, they are both in the format of [[ref1, ref2, ref3]] when feeding into sacrebleu.corpus_bleu().

references_per_prediction = len(references[0])
transformed_references = [[refs[i] for refs in references] for i in range(references_per_prediction)]

github.com

huggingface/datasets/blob/67574a8d74796bc065a8b9b49ec02f7b1200c172/metrics/sacrebleu/sacrebleu.py#L116-L119

    
      
          references_per_prediction = len(references[0])
          if any(len(refs) != references_per_prediction for refs in references):
              raise ValueError("Sacrebleu requires the same number of references for each prediction")
          transformed_references = [[refs[i] for refs in references] for i in range(references_per_prediction)]

Now I am confused again and have to continue trying to find where this inconsistency is from…

Please let me know if you have any thoughts. Thank you very much once again!

jenniferL · May 26, 2021, 3:40am

Not exactly sure if it is due to adding tokens to tokenizers; but when I comment out code below that adds tokens and resizes embedding, the two produced bleu scores match. I am still trying to debug this. Any suggestions are appreciated


import sentencepiece as spm
model = AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-fr-ro)
...
tokenizer = AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-fr-ro', padding='max_length')
spm.SentencePieceTrainer.train(input=dataset_path, model_prefix=dataset_name, vocab_size=1000) 
#sentencepiece generated files to file system and the one with tokens are loaded
with open(dataset_name + '.vocab') as infile:
        vocab_str = infile.read()
vocab_list = vocab_str.split('\n')
vocab_list = list(map(lambda x: x.split('\t')[0], vocab_list)) #extract tokens
vocab_list.remove('') #remove empty string
tokenizer.add_tokens(vocab_list)
model.resize_token_embeddings(len(tokenizer.get_vocab()))

Topic		Replies	Views
Not all BLEU scores were created equal Research	0	315	September 15, 2020
Problems with trainer.compute_metrics 🤗Transformers	1	215	September 15, 2024
Compute the BLEU using pretrained T5-small Models	2	3981	April 13, 2022
BLEU evaluation with multiple references 🤗Datasets	2	1415	July 5, 2022
What exact inputs does bleu_metric.compute() require? Beginners	5	3275	July 10, 2020

Inconsistent Bleu score between test_metrics['test_bleu'] and written-to-file test_metric.predictions

Related topics