Finetuning T5 series models with my own data

jj97 · May 16, 2024, 4:29am

Hello, I’m quite new to finetuning model with my own data.

import json
from transformers import T5Tokenizer

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base", legacy=False)

# Paths to your data files
combined_path = 'mydata_path'

# Load your dataset
with open(combined_path, 'r') as file:
    data = json.load(file)

input_lengths = []
output_lengths = []

for entry in data:
    src_text = entry['src'] + " " + entry['hist'] + " " + entry['index']
    tgt_text = entry['tgt']
    
    # Tokenize and get lengths
    input_lengths.append(len(tokenizer.tokenize(src_text)))
    output_lengths.append(len(tokenizer.tokenize(tgt_text)))

max_input_length = max(input_lengths)
max_output_length = max(output_lengths)

print(f"Max input length: {max_input_length}")
print(f"Max output length: {max_output_length}")

I analyzed my data using tokenizer and the result was like below.

Max input length: 30052
Max output length: 1012

By the way, I find out that the maximum input size for the T5-base is 512 tokens.

Is there any good way to solve this problem that I’ve faced before finetuning the T5-base with my dataset? How to deal with my big input size?

Thank you.

Topic		Replies	Views
Errors when fine-tuning T5 Beginners	7	6477	January 3, 2022
T5 Finetuning not converging Models	0	478	August 18, 2023
Fine tunning t5: Too many values to unpack (expected 2) 🤗Transformers	0	211	October 14, 2023
Issues in finetuning t5-large model 🤗Transformers	1	456	April 25, 2023
Finetuning T5 for Summarisation - Poor results Intermediate	1	530	April 28, 2024

Finetuning T5 series models with my own data

Related topics