I’m tokenizing to fine-tune a custom dataset with the goal of code generation. My tokenized dataset has the following columns: ['text', 'input_ids', 'attention_mask', "token_type_ids"], however, post-processing to fine-tune my model implies I have a ['label'] or target column. Since that is not evident here, my backward() call in training keeps failing.
Can someone help me clarify if these features (label, target…) are task-dependent? And if so, how one would go about this in tokenization?
This depends on whether you want a supervised or unsupervised model. Most models assume supervised e.g. for a given sample of input data you have the correct answer (label column). It sounds like maybe you have an unsupervised dataset. So for your training to work you either need an unsupervised model or you need to supply the labels on your dataset (in your case what the code generated should look like given some set of inputs).
Oh, I see. Yes, I duplicated my input_ids column to create the labels column but I’m not sure that would create what the generated model needs as target unless I assume the model masks certain tokens from the input and then generate examples to match the labels column I duplicated in its training. I’ll check the kind of model it is and go forth with what I find. Thank you.