Data collation: cannot understand the logics of the API

I am a bit confused about the logics behind the data collators for seq2seq models in the Hugging Face library.

My training set is composed by N subsets of pairs (source_text, target_text) with the goal of training a sequence-to-sequence model (specifically in the T5 family) to generate the target_text from the input source_text.

The experiments are based on a cross-validation where N-1 subsets are used for training and the remaining one for test.

I do not “see” how to correctly perform the encoding (tokenisation + truncation + padding) for training the model.

I want the model to be trained on input sequences of length N_in_train to output sequence of length N_out_train, i.e., I want to compute two lengths N_in_train and N_out_train rom the data in the training set.

Since the encoding (tokenisation) takes a long time (several minutes), this is what I did:

  • pre-encode the entire dataset
  • save the encoded dataset.

When training the model, for each iteration of the cross-validation I want to compute the length of the input and output sequences:

  • compute the max length of the sources and the max length of the targets.
  • From these two values, I then set N_in_train, N_out_train.
  • Then I want to train and test the model using the two values above. I need to truncate and pad sequences based on N_in_train and N_out_train.

Now, I would like to use a data collator to truncate and pad the batches in both the training and the test set. For example I wrote this (where max_source_length and max_target_length correspond to, respectively, N_in_train and N_out_train):

class MyCollator(DataCollatorForSeq2Seq):
    def __init__(self, tokenizer, max_source_length, max_target_length, model=None, padding=True, ignore_label=-100):
        super().__init__(tokenizer, model, padding, max_length)
        self.max_source_len = max_source_length
        self.max_target_len = max_target_length
        self.ignore_label = ignore_label
    #<

    def __call__(self, features):
        for feature in features:
            feature['input_ids'] = feature['input_ids'][:self.max_source_len]
            feature['attention_mask'] = feature['attention_mask'][:self.max_source_len]
            feature['labels'] = feature['labels'][:self.max_target_len]

            feature['input_ids'] = torch.tensor(
                feature['input_ids'] + [self.tokenizer.pad_token_id] * (self.max_source_len - len(feature['input_ids'])),
                dtype=torch.long
            )
            feature['attention_mask'] = torch.tensor(
                feature['attention_mask'] + [0] * (self.max_source_len - len(feature['attention_mask'])),
                dtype=torch.long
            )

            feature['labels'] = torch.tensor(
                feature['labels'] + [self.ignore_label] * (self.max_target_len - len(feature['labels'])),
                dtype=torch.long
            )
        
        return super().__call__(features)

However, it seems too clumsy for a very basic pre-processing step. Additionally, looking at the docs here: Data Collator

I see only one parameters max_length and not two for, respectively, the input and the output sequence. Why?

What is the best way to pre-tokenise a dataset (without truncation and padding), then specify only at training time the length of the various sequences?