`run_glue.py` with my own dataset of one-sentence input

Hello,

This post is related to `run_glue.py` fails when using my own dataset of regression task 路 Issue #9393 路 huggingface/transformers 路 GitHub and [examples/text-classification] `do_predict` for the test set of local datasets 路 Issue #9442 路 huggingface/transformers 路 GitHub.

While I was writing the text to open an issue, I realized that it seemed to be a simple mistake on my part.
If anyone gives the detail about it, I would appreciate your comments.

Information

Model I am using (Bert, XLNet 鈥): Bert

The problem arises when using:

  • [ ] the official example scripts: (give details below)
  • [x] my own modified scripts: (give details below)

almost the same as run_glue.py, but add some modifications in evaluation metrics, using test sets

The tasks I am working on is:

  • [ ] an official GLUE/SQUaD task: (give the name)
  • [x] my own task or dataset: (give details below)

To reproduce

It seems that an error occurs when I use run_glue.py with my own dataset of a regression task.

CUDA_VISIBLE_DEVICES=0 python <my_modified_run_glue.py> \
  --model_name_or_path bert-base-cased \
  --train_file data/****.csv \
  --validation_file data/****.csv \
  --test_file data/****.csv \   # this arg is added for issue #9442 
  --do_train \
  --do_eval \
  --do_predict \   # this arg is related to issue #9442 
  --max_seq_length 64 \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 10.0 \
  --load_best_model_at_end \
  --evaluation_strategy epoch \
  --metric_for_best_model eval_pearson \
  --output_dir **** \
  --overwrite_output_dir

An example of the train/valid CSV file is as below:

id,label,sentence1
__id_as_string__,3.0,__string__

Then, the trainer gives me the information below.

[INFO|trainer.py:387] 2021-01-07 12:52:02,202 >> The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, sentenc
e1.
[INFO|trainer.py:387] 2021-01-07 12:52:02,204 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, sente
nce1.

Expected behavior

it is natural that id column is ignored, but I didn鈥檛 know why sentence1 is ignored.

I checked again the task_to_keys in the original script:

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

Should I use 鈥渟entence鈥 instead of "sentence" if there is only one sentence in the input (in other words, sentence2 is None`)?

Thank you in advance.

I鈥檝e changed sentence1 to sentence, but the almost same info appears:

[INFO|trainer.py:387] 2021-01-07 13:22:18,233 >> The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, id.
[INFO|trainer.py:387] 2021-01-07 13:22:18,233 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, id.

Is it related to the code snippet below?

    # Preprocessing the datasets
    if data_args.task_name is not None:
        sentence1_key, sentence2_key = task_to_keys[data_args.task_name]
    else:
        # Again, we try to have some nice defaults but don't hesitate to tweak to your use case.
        non_label_column_names = [name for name in datasets["train"].column_names if name != "label"]
        if "sentence1" in non_label_column_names and "sentence2" in non_label_column_names:
            sentence1_key, sentence2_key = "sentence1", "sentence2"
        else:
            if len(non_label_column_names) >= 2:
                sentence1_key, sentence2_key = non_label_column_names[:2]
            else:
                sentence1_key, sentence2_key = non_label_column_names[0], None

Should I change the order of the columns?

I made the following changes to match sst2.

    # Preprocessing the datasets
    if data_args.task_name is not None:
        sentence1_key, sentence2_key = task_to_keys[data_args.task_name]
    else:
        # Again, we try to have some nice defaults but don't hesitate to tweak to your use case.
        non_label_column_names = [name for name in datasets["train"].column_names if name != "label"]
        if "sentence1" in non_label_column_names and "sentence2" in non_label_column_names:
            sentence1_key, sentence2_key = "sentence1", "sentence2"
        else:
            if len(non_label_column_names) >= 2:
                sentence1_key, sentence2_key = non_label_column_names[:2]
                if sentence2_key == "id" or sentence2_key == "idx":
                    sentence2_key = None
            else:
                sentence1_key, sentence2_key = non_label_column_names[0], None

    print(f"sentence1_key {sentence1_key}")
    print(f"sentence2_key {sentence2_key}")

The print functions returns below:

sentence1_key sentence
sentence2_key None

I think it is the same state as the 鈥渟st2鈥 task keys.

However, the same information remains.

[INFO|trainer.py:387] 2021-01-07 13:42:04,003 >> The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, sentenc
e.
[INFO|trainer.py:387] 2021-01-07 13:42:04,003 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, sente
nce.

The training starts, but I don鈥檛 know is it correctly using the data.
Does anyone have idea?

Excuse me for my frequent posting.

The Trainer is ignoring the column after the pre-processing because the model can鈥檛 read texts. It needs the input_ids, attention_mask etc. that are generated by the preprocessing. So this is not an error (it鈥檚 an info log, not a warning) but completely expected.

2 Likes

Thank you for your answer.

I鈥檓 relieved to hear that my usage didn鈥檛 cause an error.
I now understood that the original columns remain after preprocessing, then the preprocessed columns are used and the original columns are ignored as expected.

It seems the preprocessing of datasets is done in the function below, and I would like to read the documentation more thoroughly to better understand the behaviors of the tokenizer (and map function).

Thank you!

    def preprocess_function(examples):
        # Tokenize the texts
        args = (
            (examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key])
        )
        result = tokenizer(*args, padding=padding, max_length=max_length, truncation=True)

        # Map labels to IDs (not necessary for GLUE tasks)
        if label_to_id is not None and "label" in examples:
            result["label"] = [label_to_id[l] for l in examples["label"]]
        return result

    datasets = datasets.map(preprocess_function, batched=True, load_from_cache_file=not data_args.overwrite_cache)
1 Like