Subclassing DataCollator to pad additional inputs

UriVendict · January 24, 2023, 4:01pm

I’m trying to use a model for token classification, but since the text comes from a DocX file, I want to add additional input which represents the formatting for each token. I’ve collected the formatting and created a (short) formatting vector per token which I concat to the LLM’s output vectors, but DataCollatorForTokenClassification doesn’t pad these additional vectors. I then get an error during training (e.g. “ValueError: expected sequence of length 1024 at dim 1 (got 882)”). What is the best/easiest way to override the DataCollator’s behavior to avoid this?

Topic		Replies	Views
DataCollator not padding as expected Intermediate	0	662	August 17, 2022
DataCollator vs. Tokenizers 🤗Transformers	1	3787	May 1, 2021
DataCollator uses Tokenizer while having BatchEncodings? 🤗Transformers	0	137	February 29, 2024
Tokenizer to dataset to datacollator Beginners	1	1320	April 28, 2022
DataCollatorWithPaddings without Tokenizer Beginners	3	336	October 25, 2021

Subclassing DataCollator to pad additional inputs

Related topics