Question answering evaluation strategy

Hi there, I’m a beginner at Hugging Face(and ML field in general), took up on a challenge to learn it and I’m focusing on question answering models for now.

Now, the training part has been straightforward so far; I take my dataset, tokenize it via map function, embedding start and end positions of the correct answer in the process, so my model can compute the loss which I can use for backwards propagation, done.

But I am confused when it comes to evaluation; I don’t need to compute start and end positions, I just need 3 things for my model: input_ids, token_type_ids and attention_mask, all of which I get by simply passing my inputs to the tokenizer.
Now, when I looked up a metric to evaluate, I encountered this one: SQuAD v2 - a Hugging Face Space by evaluate-metric

I followed the instructions; they’re clear I need the specified formats for my references, check, I’ve done it all in my tokenization function:

def tokenize_question_answering_validation(element):
  question = [q.strip() for q in element["question"]]
  inputs = tokenizer(
    question, 
    element["context"], 
    max_length=None,  # automatically determined by the model
    truncation="only_second", 
    return_overflowing_tokens=True,
    return_offsets_mapping=True, 
    stride=128, 
    padding="max_length"
  )

  inputs["additional_params"] = [] # this is the "references"
  overflow_to_sample_mapping = inputs.pop("overflow_to_sample_mapping")
  for index, value in enumerate(overflow_to_sample_mapping):
# since I can have multiple features for 1 sample
#(due to context being greater than my model supports), 
#I have to use overflow mappings
    inputs["additional_params"].append({
        "answers" : {
            "answer_start": element["start_positions"][value], "text": element["text"][value] 
        },
        "id":element["id"][value]
    }) 
# this is the "references" format I found on the link, 
#I append them into my input and return them with them, 
#afterwards I pop them out as a global variable(I know, that's bad), 
#and remove the column from my tokenized dataset
#(because it tosses out an error for obvious reasons later on - can't convert to tensor)
  return inputs

Okay, so this is done, now to compute my metric all I need is the predictions from my model, right?
Since we’re doing batches, I figure I’d have as many logits “arrays” as I have batches(4 for both start and end logits let’s say), and each of those outputs has as many logits(floats) as big as my model allows(512 for example, which is default for BERT base I believe).
My initial thought of pattern was to do something along these lines:

  for step, batch in enumerate(eval_dataloader): # let's imagine I passed my dataloader thru a func.
    with torch.no_grad():
      outputs = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"], token_type_ids=batch["token_type_ids"])
# I gather my logits(start+end) and 
#I do find indices of largest logits in each "array"
    start_logits = accelerator.gather(outputs.start_logits).argmax(dim=-1)
    end_logits=accelerator.gather(outputs.end_logits).argmax(dim=-1)
    for i in range(len(start_logits)):
#and then using those computed logits 
#decode the answer my model spat out to find the text and 
#append it to my predictions which I later on use
      solution = tokenizer.decode(batch.input_ids[i][start_logits[i].item():end_logits[i].item()+1]) 
      predictions.append({
        "prediction_text":solution,
        "id":additional_params_val[start]["id"] if isCalledFromTraining else additional_params_test[start]["id"],
        "no_answer_probability":0.
      })
      start+=1

So, basically, I’d take the index of the largest logit(hence the argmax(…)) in each “subarray”(which means I get a total of batch_size outputs for each start and end positions - 4 if we’re using my case mentioned above), in a way I kinda picked the “best ones”(or did I?) for each “row” in the batch.

Now, all of this “works”, the metric gets computed, but is it right to do it this way?
I saw in the official course the way it was handled was to “score” by summing them and picking the biggest logit sum: Question answering - Hugging Face Course but effectively still picking only the “best” one out of all those that were scored. Would it even be possible to have multiple answers for one exact sample? Considering prediction ids must be unique…

Is my approach bad, if yes, can someone please tell me why, I’m lost and have been thinking this through a couple of days and it still feels like am doing something wrong…