Hello, I’m quite new to finetuning model with my own data.
import json
from transformers import T5Tokenizer
# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base", legacy=False)
# Paths to your data files
combined_path = 'mydata_path'
# Load your dataset
with open(combined_path, 'r') as file:
data = json.load(file)
input_lengths = []
output_lengths = []
for entry in data:
src_text = entry['src'] + " " + entry['hist'] + " " + entry['index']
tgt_text = entry['tgt']
# Tokenize and get lengths
input_lengths.append(len(tokenizer.tokenize(src_text)))
output_lengths.append(len(tokenizer.tokenize(tgt_text)))
max_input_length = max(input_lengths)
max_output_length = max(output_lengths)
print(f"Max input length: {max_input_length}")
print(f"Max output length: {max_output_length}")
I analyzed my data using tokenizer and the result was like below.
Max input length: 30052
Max output length: 1012
By the way, I find out that the maximum input size for the T5-base is 512 tokens.
Is there any good way to solve this problem that I’ve faced before finetuning the T5-base with my dataset? How to deal with my big input size?
Thank you.