IndexError on Evaluator (Token classification)

Hi all,

I’m trying to evaluate NER datasets. I can evaluate the performance of CoNLL-2003 English and CoNLL-2002 Spanish. But I found the index error when I evaluated the performance on CoNLL-2002 Dutch.

Here are my codes:

from evaluate import evaluator
from datasets import load_dataset
task_evaluator = evaluator("token-classification")

data = load_dataset("conll2002",'nl',split="test")

results = task_evaluator.compute(
    model_or_pipeline="xlm-roberta-large-finetuned-conll03-english",
    data=data,
    metric="seqeval",
)
print(results)

And here my errors:

Cell In [5], line 1
----> 1 results = task_evaluator.compute(
      2     model_or_pipeline="xlm-roberta-large-finetuned-conll03-english",
      3     data=data,
      4     metric="seqeval",
      5 )
      6 print(results)

File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/evaluate/evaluator/token_classification.py:253, in TokenClassificationEvaluator.compute(self, model_or_pipeline, data, subset, split, metric, tokenizer, strategy, confidence_level, n_resamples, device, random_state, input_column, label_column, join_by)
    251 # Compute predictions
    252 predictions, perf_results = self.call_pipeline(pipe, pipe_inputs)
--> 253 predictions = self.predictions_processor(predictions, data[input_column], join_by)
    254 metric_inputs.update(predictions)
    256 # Compute metrics from references and predictions

File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/evaluate/evaluator/token_classification.py:125, in TokenClassificationEvaluator.predictions_processor(self, predictions, words, join_by)
    122 token_index = 0
    123 for word_offset in words_offsets:
    124     # for each word, we may keep only the predicted label for the first token, discard the others
--> 125     while prediction[token_index]["start"] < word_offset[0]:
    126         token_index += 1
    128     if prediction[token_index]["start"] > word_offset[0]:  # bad indexing

IndexError: list index out of range

Python 3.9
Transformer 4.24.0.dev0
Evaluate 0.3.0
Torch 1.12.1

Thank you.

Struggling with the same issue (Roberta-base finetuned on a custom dataset).
Evaluating on test-set using trainer.evaluate() works fine.
Any progress for solution?