Evaluation on Quantized model yields identical results across bit-precisions

I have been trying to benchmark some models across different bit precisions. I use optimum quanto and BitsAndBytes to achieve this.
When I use trainer.evaluate() however, the output is nearly identical between the quantized models and Float models (even for 2-bit integer precision), which seems unlikely to me. Fine-tuning the model yields different results. I have a hunch that the trainer.evaluate() function does not use the quantized layers or something similar to this, as this is also the case for different models and different PTQ methods.
To quantize, I use optimum.quanto.quantize(model) and optimum.quanto.freeze(model).

Does anyone have an idea how this happens?

As a sidenote, to use trainer.evaluate() on quantized models, I created a subclass that inherits from the Trainer class to circumvent errors related to training.

Figure with results for reference: