Finetuning T5 series models with my own data

Hello, I’m quite new to finetuning model with my own data.

import json
from transformers import T5Tokenizer

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base", legacy=False)

# Paths to your data files
combined_path = 'mydata_path'

# Load your dataset
with open(combined_path, 'r') as file:
    data = json.load(file)

input_lengths = []
output_lengths = []

for entry in data:
    src_text = entry['src'] + " " + entry['hist'] + " " + entry['index']
    tgt_text = entry['tgt']
    
    # Tokenize and get lengths
    input_lengths.append(len(tokenizer.tokenize(src_text)))
    output_lengths.append(len(tokenizer.tokenize(tgt_text)))

max_input_length = max(input_lengths)
max_output_length = max(output_lengths)

print(f"Max input length: {max_input_length}")
print(f"Max output length: {max_output_length}")

I analyzed my data using tokenizer and the result was like below.

Max input length: 30052
Max output length: 1012

By the way, I find out that the maximum input size for the T5-base is 512 tokens.

Is there any good way to solve this problem that I’ve faced before finetuning the T5-base with my dataset? How to deal with my big input size?

Thank you.