Unit of max_answer_length in run_qa.py script?

I’m currently trying to finetune my BERT model on a question answering task and use the run_qa.py script. I’m just curious about what the unit of max_answer_length argument is. Is it the length in characters or in tokens after tokenization?

And is there any suggestion to which value is best to use. Does it make sense to set the value to the length of the longest answer in the dataset (which imo might be bad when there’s one really long answer and the rest is rather short), or just use like an average answer length?

Thanks in advance!

It’s in tokens. We usually take the same default as the original Google script, which works quite well.

1 Like