`run_glue.py` with my own dataset of one-sentence input

Hello,

This post is related to `run_glue.py` fails when using my own dataset of regression task · Issue #9393 · huggingface/transformers · GitHub and [examples/text-classification] `do_predict` for the test set of local datasets · Issue #9442 · huggingface/transformers · GitHub.

While I was writing the text to open an issue, I realized that it seemed to be a simple mistake on my part.
If anyone gives the detail about it, I would appreciate your comments.

Information

Model I am using (Bert, XLNet …): Bert

The problem arises when using:

  • [ ] the official example scripts: (give details below)
  • [x] my own modified scripts: (give details below)

almost the same as run_glue.py, but add some modifications in evaluation metrics, using test sets

The tasks I am working on is:

  • [ ] an official GLUE/SQUaD task: (give the name)
  • [x] my own task or dataset: (give details below)

To reproduce

It seems that an error occurs when I use run_glue.py with my own dataset of a regression task.

CUDA_VISIBLE_DEVICES=0 python <my_modified_run_glue.py> \
  --model_name_or_path bert-base-cased \
  --train_file data/****.csv \
  --validation_file data/****.csv \
  --test_file data/****.csv \   # this arg is added for issue #9442 
  --do_train \
  --do_eval \
  --do_predict \   # this arg is related to issue #9442 
  --max_seq_length 64 \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 10.0 \
  --load_best_model_at_end \
  --evaluation_strategy epoch \
  --metric_for_best_model eval_pearson \
  --output_dir **** \
  --overwrite_output_dir

An example of the train/valid CSV file is as below:

id,label,sentence1
__id_as_string__,3.0,__string__

Then, the trainer gives me the information below.

[INFO|trainer.py:387] 2021-01-07 12:52:02,202 >> The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, sentenc
e1.
[INFO|trainer.py:387] 2021-01-07 12:52:02,204 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, sente
nce1.

Expected behavior

it is natural that id column is ignored, but I didn’t know why sentence1 is ignored.

I checked again the task_to_keys in the original script:

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

Should I use “sentence” instead of "sentence" if there is only one sentence in the input (in other words, sentence2 is None`)?

Thank you in advance.

I’ve changed sentence1 to sentence, but the almost same info appears:

[INFO|trainer.py:387] 2021-01-07 13:22:18,233 >> The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, id.
[INFO|trainer.py:387] 2021-01-07 13:22:18,233 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, id.

Is it related to the code snippet below?

    # Preprocessing the datasets
    if data_args.task_name is not None:
        sentence1_key, sentence2_key = task_to_keys[data_args.task_name]
    else:
        # Again, we try to have some nice defaults but don't hesitate to tweak to your use case.
        non_label_column_names = [name for name in datasets["train"].column_names if name != "label"]
        if "sentence1" in non_label_column_names and "sentence2" in non_label_column_names:
            sentence1_key, sentence2_key = "sentence1", "sentence2"
        else:
            if len(non_label_column_names) >= 2:
                sentence1_key, sentence2_key = non_label_column_names[:2]
            else:
                sentence1_key, sentence2_key = non_label_column_names[0], None

Should I change the order of the columns?

I made the following changes to match sst2.

    # Preprocessing the datasets
    if data_args.task_name is not None:
        sentence1_key, sentence2_key = task_to_keys[data_args.task_name]
    else:
        # Again, we try to have some nice defaults but don't hesitate to tweak to your use case.
        non_label_column_names = [name for name in datasets["train"].column_names if name != "label"]
        if "sentence1" in non_label_column_names and "sentence2" in non_label_column_names:
            sentence1_key, sentence2_key = "sentence1", "sentence2"
        else:
            if len(non_label_column_names) >= 2:
                sentence1_key, sentence2_key = non_label_column_names[:2]
                if sentence2_key == "id" or sentence2_key == "idx":
                    sentence2_key = None
            else:
                sentence1_key, sentence2_key = non_label_column_names[0], None

    print(f"sentence1_key {sentence1_key}")
    print(f"sentence2_key {sentence2_key}")

The print functions returns below:

sentence1_key sentence
sentence2_key None

I think it is the same state as the “sst2” task keys.

However, the same information remains.

[INFO|trainer.py:387] 2021-01-07 13:42:04,003 >> The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, sentenc
e.
[INFO|trainer.py:387] 2021-01-07 13:42:04,003 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, sente
nce.

The training starts, but I don’t know is it correctly using the data.
Does anyone have idea?

Excuse me for my frequent posting.

The Trainer is ignoring the column after the pre-processing because the model can’t read texts. It needs the input_ids, attention_mask etc. that are generated by the preprocessing. So this is not an error (it’s an info log, not a warning) but completely expected.

7 Likes

Thank you for your answer.

I’m relieved to hear that my usage didn’t cause an error.
I now understood that the original columns remain after preprocessing, then the preprocessed columns are used and the original columns are ignored as expected.

It seems the preprocessing of datasets is done in the function below, and I would like to read the documentation more thoroughly to better understand the behaviors of the tokenizer (and map function).

Thank you!

    def preprocess_function(examples):
        # Tokenize the texts
        args = (
            (examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key])
        )
        result = tokenizer(*args, padding=padding, max_length=max_length, truncation=True)

        # Map labels to IDs (not necessary for GLUE tasks)
        if label_to_id is not None and "label" in examples:
            result["label"] = [label_to_id[l] for l in examples["label"]]
        return result

    datasets = datasets.map(preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache)
1 Like

@yusukemori @sgugger have you solved the problem? I’ve got this error that makes the training loss undetected.

Hi @jhonsonlee ,
Yes, the problem has been solved. There was no error in the first place, it was just me looking at the info log and getting worried, but it was working fine.

Could you please check if sentence1_key and sentence2_key are set as you expected?

What’s more, the code I used is an old one, so if you can tell me which version you are using, I may be able to examine it in more detail.