Is BART guaranteed to not mess up unmasked tokens during text infilling?

Hi all, I am following this example in the document to do text infilling with BART. The model is expected to produce the following output given the input:

Input:  UN Chief Says There Is No **<mask>** in Syria
Output: UN Chief Says There Is No **Plan to Stop Chemical Weapons** in Syria

When I look into the generate() method, it appears to me that the output sentence is generated in a token-by-token fashion from the first (UN) to the last (Syria) token by beam search.
So I am not sure how it keeps the unmasked tokens unchanged during the decoding stage, for example, how does it prevent the following case from happening?

Input:  UN Chief Says There Is No **<mask>** in Syria
Output: UN Chief Says There Is No **Plan to Stop Chemical Weapons** in **Iraq**

Is it possible that BART can mess up unmasked tokens (with low probability)? If not, what does it do to prevent that?

Thanks.