Hi, I am using the Trainer
class to perform masked language modeling with a pretrained Bert checkpoint (fine-tune on own domain). I’m using the official run_language_modeling.py
script from
https://github.com/huggingface/transformers/tree/master/examples/language-modeling. I merely changed it to use a different dataset implementation, as the default ones load the full data into memory. But __getitem__
works the same, returning a token-id converted sequence at a time, so this should not matter.
Even when setting the mlm probability to something ridiculous like 0.999
, this is what I observe:
The loss becomes zero rather quickly (full dataset would be 40k training steps)
This is the command I’m using, working on google cloud.
python3 xla_spawn.py --num_cores=8 train_mlm.py \
--output_dir=real_runs/6 \
--model_type=bert \
--model_name_or_path=Rostlab/prot_bert \
--do_train \
--train_data_file=/home/preprocessed_allorgs_alllengths.txt \
--mlm \
--line_by_line \
--block_size 512 \
--max_steps 30000 \
--out_of_core \
--logging_steps 20 \
--learning_rate 0.00001 \
--per_device_train_batch_size 20 \
--lazy \
--run_name high_mlm_prob \
--save_steps 2000 \
--warmup_steps 2666 \
--weight_decay 0.01 \
--mlm_probability=0.9999
(train_mlm.py
is run_language_modeling.py
with a custom dataset class, --lazy
is an arg for that dataset. The rest is default, didn’t change anything about it.)
Am I missing something with regards to the setup? This just seems wrong in general, and not an issue with the dataset.