Is it possible to train a seq2seq model like T5 or BART to convert a string into a list of strings? On my first attempt, the tokenizer complained that my 2D list of labels isn’t the correct data type:
File "/home/matt/miniconda3/envs/nlp/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 429, in _batch_encode_plus encodings = self._tokenizer.encode_batch( TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
I suppose I could concatenate the multiple strings in each of my training examples, but then I’d have to use a potentially error-prone splitter to split them up again. Maybe using a special character as a delimiter is the answer here?
It’s not super relevant, but here’s how I’m invoking the tokenizer, using a subclass of
tokenizer = AutoTokenizer.from_pretrained(args.model_name) encodings = tokenizer(texts, truncation=True, padding=True, return_tensors='pt') decodings = tokenizer(labels, truncation=True, padding=True, return_tensors='pt') dataset_tokenized = Dataset(encodings, decodings)
What is relevant is that my
texts variable is a list of strings, and my
labels variable is a 2D list of strings, which obviously isn’t allowed.