BERTScore with multiple references

Hi,

I’m using the BERTScore implementation from evaluate with multiple references. I wanted to know how it works: if we have 2 references, does it average the metric between the references? It seems like this is not the case, based on the following (failing) test:

def test_bert_score_implementations():
    references = [['hi how are you', 'what are you doing']]
    predictions = ['hi how do you do'] 
    bert_score_metric = evaluate.load("bertscore")
    # check that bert_score_metric.compute with references = the references array
    # is the same as calling it twice with each reference

    predicted_result = bert_score_metric.compute(predictions=predictions, references=references, lang='en')['f1'][0]
    expected_result = (bert_score_metric.compute(predictions=predictions, references=[[references[0][0]]], lang='en')['f1'][0] + \
                          bert_score_metric.compute(predictions=predictions, references=[[references[0][1]]], lang='en')['f1'][0]) / 2
    # following assertion fails
    assert np.isclose(
        predicted_result,  # 0.9408445954322815
        expected_result   # 0.9233997464179993
    )

I’m wondering then how the implementation arrives at a single value when providing multiple references, if not an arithmetic average.

Hi!

In the BERTScore implementation, when you provide multiple references, the metric computes the F1 score for each individual reference and then selects the highest F1 score as the final result. It doesn’t calculate an arithmetic average between the references. This behavior is why you’re observing the test failure when trying to average the scores manually.

To align with the BERTScore’s approach, you should calculate the F1 scores for each reference separately and then select the highest F1 score as your predicted result.

import numpy as np
import evaluate

references = [['hi how are you', 'what are you doing']]
predictions = ['hi how do you do'] 
bert_score_metric = evaluate.load("bertscore")

predicted_results = []
for ref in references[0]:
    predicted_result = bert_score_metric.compute(predictions=predictions, references=[[ref]], lang='en')['f1'][0]
    predicted_results.append(predicted_result)

expected_result = max(predicted_results)

assert np.isclose(
    predicted_results[0],  # 0.9408445954322815
    expected_result        # 0.9408445954322815
)

This approach should match with how BERTScore selects the highest F1 score among the individual references.

1 Like