Hi,

I’m using the BERTScore implementation from evaluate with multiple references. I wanted to know how it works: if we have 2 references, does it average the metric between the references? It seems like this is not the case, based on the following (failing) test:

```
def test_bert_score_implementations():
references = [['hi how are you', 'what are you doing']]
predictions = ['hi how do you do']
bert_score_metric = evaluate.load("bertscore")
# check that bert_score_metric.compute with references = the references array
# is the same as calling it twice with each reference
predicted_result = bert_score_metric.compute(predictions=predictions, references=references, lang='en')['f1'][0]
expected_result = (bert_score_metric.compute(predictions=predictions, references=[[references[0][0]]], lang='en')['f1'][0] + \
bert_score_metric.compute(predictions=predictions, references=[[references[0][1]]], lang='en')['f1'][0]) / 2
# following assertion fails
assert np.isclose(
predicted_result, # 0.9408445954322815
expected_result # 0.9233997464179993
)
```

I’m wondering then how the implementation arrives at a single value when providing multiple references, if not an arithmetic average.