Hi, I am using the
Trainer class to perform masked language modeling with a pretrained Bert checkpoint (fine-tune on own domain). I’m using the official
run_language_modeling.py script from
https://github.com/huggingface/transformers/tree/master/examples/language-modeling. I merely changed it to use a different dataset implementation, as the default ones load the full data into memory. But
__getitem__ works the same, returning a token-id converted sequence at a time, so this should not matter.
Even when setting the mlm probability to something ridiculous like
0.999, this is what I observe:
The loss becomes zero rather quickly (full dataset would be 40k training steps)
This is the command I’m using, working on google cloud.
python3 xla_spawn.py --num_cores=8 train_mlm.py \ --output_dir=real_runs/6 \ --model_type=bert \ --model_name_or_path=Rostlab/prot_bert \ --do_train \ --train_data_file=/home/preprocessed_allorgs_alllengths.txt \ --mlm \ --line_by_line \ --block_size 512 \ --max_steps 30000 \ --out_of_core \ --logging_steps 20 \ --learning_rate 0.00001 \ --per_device_train_batch_size 20 \ --lazy \ --run_name high_mlm_prob \ --save_steps 2000 \ --warmup_steps 2666 \ --weight_decay 0.01 \ --mlm_probability=0.9999
run_language_modeling.py with a custom dataset class,
--lazy is an arg for that dataset. The rest is default, didn’t change anything about it.)
Am I missing something with regards to the setup? This just seems wrong in general, and not an issue with the dataset.