How can I pad the vocab to a set multiple?

Probably an easy one, but not having any luck in finding the solution, so thought I’d make a post.

To use tensor cores effectively with mixed precision training a NVIDIA guide recommends to “pad vocabulary to be a multiple of 8”.

I’ve searched the tokenizers documentation for answers but haven’t found much luck. The closest I could find is the pp_tokenizer.vocab_size method that returns the current vocab size, but I can’t assign it a new value.

Any idea how I can do this?


You can provide the argument pad_to_multiple_of to a tokenizer in Transformers (this is supported both for fast and slow tokenizers):

pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
>= 7.5 (Volta).

Thanks for the reply. This method is good and useful for a sequence input, but I was more looking for something that could resize the matrices that depend on the vocab_size of a transformer.

e.g. the embedding matrix of a transformer usually takes on dimensions something like (vocab_size, 1024), where the vocab_size might be something like 52153. A matrix of this size isn’t efficient to pass onto the tensor cores, so I was looking for a way to pad it so it was a multiple of 8 (e.g. to 52160).

You can resize the embedding matrix of a Transformer model using the resize_token_embeddings method (see docs).

thanks! that looks like what I’m looking for