I have been struggling to get padding to work properly with llama based models.
It seems like llama by default does not use a pad token. Does this mean that you simply can’t have batch_size > 1 ?
But some suggestions on github include to set pad_token = eos_token. But the issue with that is that pad_token_id is actually set in the generation_config generation_config.json · lmsys/vicuna-13b-delta-v1.1 at main
pad_token_id is set to 0. Why is there an inconsistency? Does this matter should I also set pad_token_id to eos_token_id?
Currently I simply set pad_token = eos_token, but keep the pad_token_id as 0 and am noticing poorer performance with batch inference as compared to single inference.