Questions about training bert with two columns data

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.29.2
  • Platform: Linux-4.18.0-477.51.1.el8_8.x86_64-x86_64-with-glibc2.28
  • Python version: 3.10.14
  • Huggingface_hub version: 0.24.6
  • Safetensors version: 0.4.5
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker
Hi, I have a question about how to train a BERT model with two columns as input. My input format is like:

text1 text2 (as titles)
string1 string2
string1 string2
string1 string2
…

Which is line by line but each row contains two paired strings. I wonder how to use run_mlm.py to read these two strings and train seperately. Currently, if I try:

tokenizer(
                examples[text_col1],
                examples[text_col2],
                padding=padding,
                truncation=True,
                max_length=max_seq_length,
                # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it
                # receives the `special_tokens_mask`.
                return_special_tokens_mask=True,
            )

The sentence will be like [cls] string 1 [seq] string2. However, I would expect to see: [cls] string 1 [cls] string2, or will it affect my training if the cls and seq can be ignored? I intend to compute the string embedding based on the tokens.

Furthermore, when I evaluate the prediction outputs, I found that preds actually only has the size of string1 (1st column), and the content in string2 will be removed. Is it caused by the truncation? However, the shape of labels is (1000,255), but for preds is (1000, 127). They are both shorter than 512. Why I will meet this error? Thanks.

Traceback (most recent call last):
  File "/gpfs/radev/project/ying_rex/tl688/DNABERT/examples/run_mlm_contras.py", line 777, in <module>
    main()
  File "/gpfs/radev/project/ying_rex/tl688/DNABERT/examples/run_mlm_contras.py", line 743, in main
    metrics = trainer.evaluate()
  File "/gpfs/radev/project/ying_rex/tl688/dnabert2/lib/python3.10/site-packages/transformers/trainer.py", line 3029, in evaluate
    output = eval_loop(
  File "/gpfs/radev/project/ying_rex/tl688/dnabert2/lib/python3.10/site-packages/transformers/trainer.py", line 3318, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/gpfs/radev/project/ying_rex/tl688/DNABERT/examples/run_mlm_contras.py", line 693, in compute_metrics
    preds = preds[mask]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 127000 but corresponding boolean dimensionis 255000

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Please check the run_mlm.py file.

Hi, I have a question about how to train a BERT model with two columns as input. My input format is like:

text1 text2 (as titles)
string1 string2
string1 string2
string1 string2
…

Which is line by line but each row contains two paired strings. I wonder how to use run_mlm.py to read these two strings and train seperately. Currently, if I try:

tokenizer(
                examples[text_col1],
                examples[text_col2],
                padding=padding,
                truncation=True,
                max_length=max_seq_length,
                # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it
                # receives the `special_tokens_mask`.
                return_special_tokens_mask=True,
            )

The sentence will be like [cls] string 1 [seq] string2. However, I would expect to see: [cls] string 1 [cls] string2, or will it affect my training if the cls and seq can be ignored? I intend to compute the string embedding based on the tokens.

Furthermore, when I evaluate the prediction outputs, I found that preds actually only has the size of string1 (1st column), and the content in string2 will be removed. Is it caused by the truncation? However, the shape of labels is (1000,255), but for preds is (1000, 127). They are both shorter than 512. Why I will meet this error? Thanks.

Traceback (most recent call last):
  File "/gpfs/radev/project/ying_rex/tl688/DNABERT/examples/run_mlm_contras.py", line 777, in <module>
    main()
  File "/gpfs/radev/project/ying_rex/tl688/DNABERT/examples/run_mlm_contras.py", line 743, in main
    metrics = trainer.evaluate()
  File "/gpfs/radev/project/ying_rex/tl688/dnabert2/lib/python3.10/site-packages/transformers/trainer.py", line 3029, in evaluate
    output = eval_loop(
  File "/gpfs/radev/project/ying_rex/tl688/dnabert2/lib/python3.10/site-packages/transformers/trainer.py", line 3318, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/gpfs/radev/project/ying_rex/tl688/DNABERT/examples/run_mlm_contras.py", line 693, in compute_metrics
    preds = preds[mask]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 127000 but corresponding boolean dimensionis 255000

Expected behavior

If should not report errors.