AFAICT the mean_iou function from the Evaluate library does not actually compute IoU, but instead it computes recall (sensitivity). I’ve only discovered this because I wrote my own function to compute Sorensen-Dice (a.k.a. the F1-score); my code computes TP / TN / FP / FN first, and from that it computes Dice, precision, recall, and also (on a whim) IoU.
I’ve noticed my IoU was different from the library IoU. But the library IoU was the same as my recall. Checking the library code, I’ve noticed that indeed it seems to compute recall (sensitivity) instead of IoU. I’ve filed an issue here:
Please check my findings and let me know what you think.
If this is correct, the impact of the bug is substantial. Recall is bigger than IoU, which leads to an overestimate of model performance.
I’ve only run tests with SegFormer, with a dataset with a single class (plus background). I have not tested multiclass segmentation with my code.