How to specify labels-column in BERT

Hi,
i’m trying to follow the huggingface datasets tutorial to finetune a BERT model on a custom dataset for sentiment analysis. The quicktour states:

rename our label column in labels which is the expected input name for labels in BertForSequenceClassification.

In the docs for to_tf_dataset it states:

label_cols – Dataset column(s) to load as labels. Note that many models compute loss internally rather than letting Keras do it, in which case it is not necessary to actually pass the labels here, as long as they’re in the input columns.

I am uncertain if label_cols can be used to specify labels for differently named columns, or if it only possible to pass labels with a column named label inside the columns parameter?

Hi! It seems like you are following the old documentation. The new docs are available here: Datasets — datasets 1.17.0 documentation (this is an example with to_tf_dataset: Train with 🤗 Datasets — datasets 1.17.0 documentation).

I am uncertain if label_cols can be used to specify labels for differently named columns, or if it only possible to pass labels with a column named label inside the columns parameter?

Yes, label_cols supports columns not necessarily named label.

cc @Rocketknight1

thanks for your reply!
The new documentation also has this quote where label is renamed to labels. So if i have a dataset where the label column is not called labels i have to specify it with label_cols right?

I looked at this tutorial as well, where the label column is called label, and label_cols is not specified here. So these two tutorials seem to show conflicting information :confused:

Hi @fogx, this is a good question! Here’s what’s happening in to_tf_dataset: columns specifies the list of columns to be passed as the input to the model, and label_cols specifies the list of columns to be passed to Keras at the label. For most tasks (including sentiment analysis), you will usually only want one column to be passed here, in which case it doesn’t really matter what it’s called because to_tf_dataset will only make the labels a dict when there are multiple label columns.

Sentiment analysis is an example of a ‘text classification’ task, so if you want a tutorial on that specifically, please take a look at this notebook or the colab link.