Train T5/BART to convert a string into multiple strings

mph · December 3, 2022, 6:08pm

Is it possible to train a seq2seq model like T5 or BART to convert a string into a list of strings? On my first attempt, the tokenizer complained that my 2D list of labels isn’t the correct data type:

File "/home/matt/miniconda3/envs/nlp/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 429, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

I suppose I could concatenate the multiple strings in each of my training examples, but then I’d have to use a potentially error-prone splitter to split them up again. Maybe using a special character as a delimiter is the answer here?

It’s not super relevant, but here’s how I’m invoking the tokenizer, using a subclass of torch.utils.data.Dataset:

tokenizer = AutoTokenizer.from_pretrained(args.model_name)
encodings = tokenizer(texts, truncation=True, padding=True, return_tensors='pt')
decodings = tokenizer(labels, truncation=True, padding=True, return_tensors='pt')
dataset_tokenized = Dataset(encodings, decodings)

What is relevant is that my texts variable is a list of strings, and my labels variable is a 2D list of strings, which obviously isn’t allowed.

mph · December 10, 2022, 7:24am

Using a special delimiter worked great! I chose the pipe character.

pairs = [(source, ' | '.join(target)) for source, target in pairs]

Topic		Replies	Views
Train tokenizer for seq2seq model 🤗Tokenizers	0	344	April 19, 2024
Convert Bart to seq to seq form 🤗Transformers	0	308	July 5, 2022
How can we pass a list of strings to a fine tuned bert model? 🤗Transformers	0	507	August 18, 2022
Efficient detokenization method 🤗Transformers	3	2045	January 28, 2021
Why can't I pass my directly encoded inputs to a model? Beginners	5	4533	July 25, 2022

Train T5/BART to convert a string into multiple strings

Related topics