I’m working on a project where I need to create a fine-tuned model that can take a sentence as input and output scores for a group of labels. I have a labeled dataset with 10,000 records, but I’m unsure how to handle the label columns. Specifically, I need guidance on how to convert these labels into numerical format and which type of model would be suitable for this task.
Any help or suggestions would be greatly appreciated. Thank you!
Thank you, that was really helpful! However, my dataset contains around 255 unique labels. Converting all of them to numerical values isn’t practical. How can I address this issue?
i think it is zero shot classification task I’m not sure
Thanks, I’ve converted the labels to numerical values. Now, I’m wondering if it’s possible to fine-tune a zero-shot classification model. If so, could you please share a link or guide me on how to do it? or i must use the pipeline directly without fine tune
Thank you for the information. I believe the first link is for text classification, not specifically designed for zero-shot classification. Can you clarify if there’s any difference in fine tuned between the two? I want a zero-shot
Also, when fine-tuning a zero-shot classification model, what should the dataset format be? Should it be a labeled dataset with specific labels, or should it follow a specific structure, such as having two columns (premise and hypothesis) with labels like 1 (neutral), 0 (entailment), and 2 (contradiction)?
https://discuss.huggingface.co/t/new-pipeline-for-zero-shot-text-classification/681/14#:~:text=Thanks%20for%20the,add%20a%20bit%3A The link above describes the finetuning of zero-shot models. The difference between finetuning of sequences and NLI is that the former uses a custom number of labels and only the sequence to classify and the latter uses three labels in the specific order and the sequence to classify followed by the putative entailment as specified in the link above. That link also suggests to use zero-shot classification only if you don’t have enough labeled data.