Vision Transformer embeddings interpolation

When finetuning on larger image sizes, there is a huge discrepency between the way Huggingface does interpolation on the embeddings and the way timm library does interpolation.

Timm library first interpolates the embeddings layer and resize it once right before finetuning, whereas Huggingface continuously interpolates the embeddings while finetuning. Are both of these ways proven to not affect the performance that much? If not, which one is the correct way?