Pipelines for mutliple inputs don't produce reliable results

I am using a text classification pipeline (‘sentiment-analysis’) with a fine-tuned ELECTRA model and transformers version 4.5.1
For some reason, calling the pipeline for a list of inputs will result in different outputs for each input than with applying the pipeline to each input! Why is this like that? I went through the patch notes but couldn’t see any fix for this issue, so I’m not sure if this still persists in recent versions.

Okay, so I boiled this down to the issue not being related to pipelines, but to ELECTRA. ELECTRA changes its outputs for different batch sizes.

I found out that this is not related to the transformers package, but is probably due to PyTorch optimisations with operations potentially happening in different orders depending on the input tensor - as float operations are inaccurate, this may lead to different results for the same inputs to a pipeline if combined with other input sentences. As this does not (only) depend on the shape of the tensor but also on the content, the only safe way to generate the exact same output is by applying the pipeline only for one input sentence one at a time.
For most use cases, class probabilities changing by values around 10^-6 doesn’t matter, but if you require exact results, be aware of this issue!