T5 tokenizer / ideal method of calculating max_sequence_length?

jrandel · January 27, 2022, 9:08pm

Hi!

So I’ve developed an incremental fine tune training pipeline which is based on T5-large and somewhat vexing in terms of OOM issues and whatnot, even on a V100 class GPU with 16GB of contiguous memory. And the dateset is constantly changing so I am attempting to establish ideal hyperparams with each training run by for example calculating max_sequence_length dynamically:

"max_seq_length": len(tokenizer(df.loc[df.input_text.astype(str).map(len).argmax(), 'input_text'])['input_ids'])
"max_source_length": len(tokenizer(df.loc[df.input_text.astype(str).map(len).argmax(), 'input_text'])['input_ids']),
"max_target_length": len(tokenizer(df.loc[df.target_text.astype(str).map(len).argmax(), 'target_text'])['input_ids'])

Is this a reasonable approach to keep memory consumption down? And is there a need for any padding for tokens that would be added programmatically during fine tune training?

TIA!

saireddy · May 22, 2024, 7:26pm

have you found a solution for this?

Topic		Replies	Views
Token indices sequence length is longer than the specified maximum sequence length 🤗Tokenizers	4	23271	February 15, 2023
Max length transformers problem 🤗Transformers	0	128	March 4, 2023
Limit max # of tokens for inference in pipeline? Beginners	0	1080	April 7, 2023
Model max length not set. Default value 🤗Transformers	1	633	October 6, 2024
mT5 maximum sequence length 🤗Transformers	0	422	July 2, 2022

T5 tokenizer / ideal method of calculating max_sequence_length?

Related topics