No labels column for tokenized data

ablam · June 23, 2022, 4:55pm

I’m tokenizing to fine-tune a custom dataset with the goal of code generation. My tokenized dataset has the following columns: ['text', 'input_ids', 'attention_mask', "token_type_ids"], however, post-processing to fine-tune my model implies I have a ['label'] or target column. Since that is not evident here, my backward() call in training keeps failing.

Can someone help me clarify if these features (label, target…) are task-dependent? And if so, how one would go about this in tokenization?

courtneysprouse131 · June 27, 2022, 3:08pm

This depends on whether you want a supervised or unsupervised model. Most models assume supervised e.g. for a given sample of input data you have the correct answer (label column). It sounds like maybe you have an unsupervised dataset. So for your training to work you either need an unsupervised model or you need to supply the labels on your dataset (in your case what the code generated should look like given some set of inputs).

ablam · June 27, 2022, 4:33pm

Oh, I see. Yes, I duplicated my input_ids column to create the labels column but I’m not sure that would create what the generated model needs as target unless I assume the model masks certain tokens from the input and then generate examples to match the labels column I duplicated in its training. I’ll check the kind of model it is and go forth with what I find. Thank you.

Topic		Replies	Views
Column names of custom dataset for use with trainer Beginners	3	5456	March 31, 2024
Pretokenization of dataset for finetuning 🤗Datasets	4	57	May 31, 2025
Preprocessing data for text classification, HF dataset 🤗Datasets	1	572	October 3, 2022
Generator has no attribute backward Beginners	2	429	June 21, 2022
Label 2 id not working Beginners	1	183	June 12, 2025

No labels column for tokenized data

Related topics