How can I pad the vocab to a set multiple?

tomroth1001 · November 26, 2021, 1:53am

Probably an easy one, but not having any luck in finding the solution, so thought I’d make a post.

To use tensor cores effectively with mixed precision training a NVIDIA guide recommends to “pad vocabulary to be a multiple of 8”.

I’ve searched the tokenizers documentation for answers but haven’t found much luck. The closest I could find is the pp_tokenizer.vocab_size method that returns the current vocab size, but I can’t assign it a new value.

Any idea how I can do this?

nielsr · November 26, 2021, 10:33am

Hi,

You can provide the argument pad_to_multiple_of to a tokenizer in Transformers (this is supported both for fast and slow tokenizers):

pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
>= 7.5 (Volta).

tomroth1001 · November 27, 2021, 2:50am

Thanks for the reply. This method is good and useful for a sequence input, but I was more looking for something that could resize the matrices that depend on the vocab_size of a transformer.

e.g. the embedding matrix of a transformer usually takes on dimensions something like (vocab_size, 1024), where the vocab_size might be something like 52153. A matrix of this size isn’t efficient to pass onto the tensor cores, so I was looking for a way to pad it so it was a multiple of 8 (e.g. to 52160).

nielsr · November 27, 2021, 11:28am

You can resize the embedding matrix of a Transformer model using the resize_token_embeddings method (see docs).

tomroth1001 · November 28, 2021, 8:39am

thanks! that looks like what I’m looking for

Topic		Replies	Views
Questions about vocab size, decoder start token, padding token, and appropriate config for custom seq2seq transformer model without any tokenizer 🤗Transformers	0	49	October 11, 2024
About loading embed 🧨 Diffusers	2	1504	November 6, 2023
Importance of padding for tokens and same size inputs for transformers 🤗Transformers	1	677	October 22, 2021
It asks to add padding or truncation but I have already done it Beginners	1	821	October 6, 2023
Set_format('torch') returns lists of tensors for multiple-entries sample 🤗Datasets	2	480	November 11, 2022

How can I pad the vocab to a set multiple?

Related topics