Train tokenizer for seq2seq model

Nevermetyou · April 19, 2024, 1:14am

Hi all,

I am trying to train the seq2seq model using t5v1.1
My task is simple; I want to map text from format A to format B
for example, 19 July 2020 → 19/07/2020. This is not my real data, just an example of what I am doing.

I want to train the tokenizer for this data, but for the seq2seq model, the tokenizer needs to tokenize both input data and label, right?

So, I am a bit confused about arranging the data to train the tokenizer.
Should I concatenate input data and label together and then pass it old_tokenizer.train_new_from_iterator

I have read the doc Training a new tokenizer from an old one - Hugging Face NLP Course
But I am still confused about seq2seq setting

Thanks

Topic		Replies	Views
LaTeX friendly Seq2Seq Model Beginners	0	258	February 13, 2023
Train T5/BART to convert a string into multiple strings 🤗Transformers	1	1676	December 10, 2022
How to properly add news tokens to tokenizer vocab? Beginners	0	155	May 14, 2024
Extremely confusing or non-existent documentation about the Seq2Seq trainer Beginners	1	4450	December 16, 2021
Using Seq2SeqTrainer for decoders? 🤗Transformers	0	86	December 25, 2024

Train tokenizer for seq2seq model

Related topics