How to specify labels-column in BERT

fogx · January 13, 2022, 11:46am

Hi,
i’m trying to follow the huggingface datasets tutorial to finetune a BERT model on a custom dataset for sentiment analysis. The quicktour states:

rename our label column in labels which is the expected input name for labels in BertForSequenceClassification.

In the docs for to_tf_dataset it states:

label_cols – Dataset column(s) to load as labels. Note that many models compute loss internally rather than letting Keras do it, in which case it is not necessary to actually pass the labels here, as long as they’re in the input columns.

I am uncertain if label_cols can be used to specify labels for differently named columns, or if it only possible to pass labels with a column named label inside the columns parameter?

mariosasko · January 19, 2022, 3:55pm

Hi! It seems like you are following the old documentation. The new docs are available here: https://huggingface.co/docs/datasets/master (this is an example with to_tf_dataset: https://huggingface.co/docs/datasets/master/use_dataset.html#tensorflow).

I am uncertain if label_cols can be used to specify labels for differently named columns, or if it only possible to pass labels with a column named label inside the columns parameter?

Yes, label_cols supports columns not necessarily named label.

cc @Rocketknight1

fogx · January 20, 2022, 8:54am

thanks for your reply!
The new documentation also has this quote where label is renamed to labels. So if i have a dataset where the label column is not called labels i have to specify it with label_cols right?

fogx · January 20, 2022, 9:14am

I looked at this tutorial as well, where the label column is called label, and label_cols is not specified here. So these two tutorials seem to show conflicting information
https://huggingface.co/docs/transformers/master/custom_datasets

Rocketknight1 · January 20, 2022, 1:05pm

Hi @fogx, this is a good question! Here’s what’s happening in to_tf_dataset: columns specifies the list of columns to be passed as the input to the model, and label_cols specifies the list of columns to be passed to Keras at the label. For most tasks (including sentiment analysis), you will usually only want one column to be passed here, in which case it doesn’t really matter what it’s called because to_tf_dataset will only make the labels a dict when there are multiple label columns.

Sentiment analysis is an example of a ‘text classification’ task, so if you want a tutorial on that specifically, please take a look at this notebook or the colab link.

Topic		Replies	Views
Column names of custom dataset for use with trainer Beginners	3	5433	March 31, 2024
BERT for Dataset with two label columns Beginners	1	467	January 22, 2024
How to create custom ClassLabels? 🤗Datasets	3	7451	January 20, 2022
Dataset label format for multi-label text classification 🤗Datasets	9	13276	February 9, 2023
Correct use of dataset.class_encode_column 🤗Datasets	1	2519	July 17, 2023

How to specify labels-column in BERT

Related topics