How to train from scratch with run_mlm.py, .txt file?

Hello! Essentially what I want to do is: point the code at a .txt file, and get a trained model out. How can I use run_mlm.py to do this? I’d be satisfied if someone could help me figure out how to even just recreate the EsperBERTo tutorial.

I’m getting bogged down in flags, trying to load tokenizers, errors, etc.

What I’ve done so far: I managed to run through the EsperBERTo tutorial (here), and now I’m trying to do the same thing with run_mlm.py.

I went to the examples at transformers/examples/pytorch/language-modeling at master · huggingface/transformers · GitHub first, and I’ve been attempting to adapt those.

I keep running into problems with the tokenizer.

  • run_mlm.py doesn’t let you not specify a tokenizer when you are training from scratch, apparently.
  • if you give it --model_type roberta and --tokenizer_name <path to the vocab.json and mergest.txt> it complains about not being able to find a config.json
  • if you then put a config.json for the Model, not the tokenizer in that folder, the error goes away. But why is it looking in the tokenizer folder for model config? And why does it need a model config at all? I told it model type?

What finally I did was:

  • made a folder called EsperBERTo. In there I put the vocab.json and merges.txt from the Colab tutorial, and a config.json that I found for RobertaForMaskedLM here
  • ran the following command:
python run_mlm.py \
    --model_type roberta \
    --tokenizer_name /path/to/EsperBERTo/ \
    --train_file /path/to/oscar.eo.txt \
    --max_seq_length 512 \
    --do_train \
    --output_dir ./output/test-mlm

But now I’m getting “IndexError: index out of range in self”. Some Googling lead me to believe this might be to do with the vocab size? I edited config.json to match the tokenizer (52000 vocab size), but no dice.

Here’s config.json

{
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 207,
  "model_type": "roberta",
  "num_attention_heads": 6,
  "num_hidden_layers": 3,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 52000
}

At this point I’m thoroughly confused as to how to proceed, and I’m not even sure I’m going down the right path.

Can anyone give guidance on the proper process to go from .txt file to trained model in the command line? I’m not having much success transferring my knowledge (and my trained tokenizer) from Colab.

You seem to be on the correct path, could you tell us more about the index error you encountered? What did the stack trace look like? Also could you try briefly with another model than roberta (like bert for instance) and report if the error disappears?

Sure, @sgugger, here’s the complete output.

As for “try briefly with another model”, what all do I even need to change? Can I simply try changing --model_type, or do I also have to change config.json, tokenzer, etc etc. ?

$ python run_mlm.py     --model_type roberta     --tokenizer_name /home/cleong/projects/personal/colin-summer-2021/EsperBERTo/     --train_file /home/cleong/projects/personal/colin-summer-2021/data/oscar.eo.txt     --max_seq_length 512     --do_train     --output_dir ./output/test-mlm
/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
06/07/2021 09:06:58 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0distributed training: False, 16-bits training: False
06/07/2021 09:06:58 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=./output/test-mlm, overwrite_output_dir=False, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Jun07_09-06-58_act3admin-Precision-7730, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=./output/test-mlm, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=[], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, log_on_each_node=True, _n_gpu=0, mp_parameters=)
06/07/2021 09:06:58 - WARNING - datasets.builder -   Using custom data configuration default-77be700d26e27b24
06/07/2021 09:06:58 - WARNING - datasets.builder -   Reusing dataset text (/home/cleong/.cache/huggingface/datasets/text/default-77be700d26e27b24/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
06/07/2021 09:06:58 - WARNING - __main__ -   You are instantiating a new config instance from scratch.
[INFO|configuration_utils.py:515] 2021-06-07 09:06:58,903 >> loading configuration file /home/cleong/projects/personal/colin-summer-2021/EsperBERTo/config.json
[INFO|configuration_utils.py:553] 2021-06-07 09:06:58,904 >> Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 207,
  "model_type": "roberta",
  "num_attention_heads": 6,
  "num_hidden_layers": 3,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.7.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

[INFO|tokenization_utils_base.py:1651] 2021-06-07 09:06:58,904 >> Didn't find file /home/cleong/projects/personal/colin-summer-2021/EsperBERTo/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1651] 2021-06-07 09:06:58,905 >> Didn't find file /home/cleong/projects/personal/colin-summer-2021/EsperBERTo/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1651] 2021-06-07 09:06:58,905 >> Didn't find file /home/cleong/projects/personal/colin-summer-2021/EsperBERTo/special_tokens_map.json. We won't load it.
[INFO|tokenization_utils_base.py:1651] 2021-06-07 09:06:58,905 >> Didn't find file /home/cleong/projects/personal/colin-summer-2021/EsperBERTo/tokenizer_config.json. We won't load it.
[INFO|tokenization_utils_base.py:1715] 2021-06-07 09:06:58,905 >> loading file /home/cleong/projects/personal/colin-summer-2021/EsperBERTo/vocab.json
[INFO|tokenization_utils_base.py:1715] 2021-06-07 09:06:58,905 >> loading file /home/cleong/projects/personal/colin-summer-2021/EsperBERTo/merges.txt
[INFO|tokenization_utils_base.py:1715] 2021-06-07 09:06:58,905 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-06-07 09:06:58,905 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-06-07 09:06:58,906 >> loading file None
[INFO|tokenization_utils_base.py:1715] 2021-06-07 09:06:58,906 >> loading file None
06/07/2021 09:06:59 - INFO - __main__ -   Training new model from scratch
06/07/2021 09:07:01 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/cleong/.cache/huggingface/datasets/text/default-77be700d26e27b24/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-994c5abeed4d6e58.arrow
06/07/2021 09:07:01 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/cleong/.cache/huggingface/datasets/text/default-77be700d26e27b24/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5/cache-2cb7998233554805.arrow
[INFO|trainer.py:514] 2021-06-07 09:07:01,719 >> The following columns in the training set  don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask.
[INFO|trainer.py:1147] 2021-06-07 09:07:01,724 >> ***** Running training *****
[INFO|trainer.py:1148] 2021-06-07 09:07:01,724 >>   Num examples = 143129
[INFO|trainer.py:1149] 2021-06-07 09:07:01,724 >>   Num Epochs = 3
[INFO|trainer.py:1150] 2021-06-07 09:07:01,724 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1151] 2021-06-07 09:07:01,724 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:1152] 2021-06-07 09:07:01,724 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1153] 2021-06-07 09:07:01,724 >>   Total optimization steps = 53676
  0%|                                                                                                                                                                             | 0/53676 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/cleong/projects/personal/colin-summer-2021/run_mlm.py", line 500, in <module>
    main()
  File "/home/cleong/projects/personal/colin-summer-2021/run_mlm.py", line 451, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/transformers/trainer.py", line 1263, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/transformers/trainer.py", line 1741, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/transformers/trainer.py", line 1773, in compute_loss
    outputs = model(**inputs)
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/transformers/models/roberta/modeling_roberta.py", line 1049, in forward
    outputs = self.roberta(
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/transformers/models/roberta/modeling_roberta.py", line 808, in forward
    embedding_output = self.embeddings(
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/transformers/models/roberta/modeling_roberta.py", line 122, in forward
    position_embeddings = self.position_embeddings(position_ids)
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 156, in forward
    return F.embedding(
  File "/home/cleong/miniconda3/envs/languagemodel/lib/python3.9/site-packages/torch/nn/functional.py", line 1916, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

All right, changing absolutely nothing except model_type, I no longer get “IndexError: index out of range in self”, and it seems to be training? But yet, it still reads in and displays this from config.json:

[INFO|configuration_utils.py:553] 2021-06-07 09:17:42,045 >> Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 207,
  "model_type": "roberta",
  "num_attention_heads": 6,
  "num_hidden_layers": 3,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.7.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

Updated command, only changed model_type:

python run_mlm.py \
    --model_type bert \
    --tokenizer_name /home/cleong/projects/personal/colin-summer-2021/EsperBERTo/ \
    --train_file /home/cleong/projects/personal/colin-summer-2021/data/oscar.eo.txt \
    --max_seq_length 512 \
    --do_train \
    --output_dir ./output/test-mlm

The config is part of the logs of this script, so it’s logical you have it displayed, so using another model solves your issue. There is a weird thing in the positional embeddings of roberta, you might have to add 2 to max_position_embeddings for it to work without indexing error.

Tried adding 2 to the max_position_embeddings in the config.json, now 209. Same error.

The reason I’m confused that it’s displaying config.json is because I’m not using config.json in the arguments. I specify tokenizer_name, and then it apparently looks in that folder for the tokenizer vocab and etc., but why does it also look for the model config there?

Added some debug code to functional.py in the pytorch library:

    print("*******************")
    print(f"weight.size(): {weight.size()}")
    print(f"weight: {weight}")
    print(f"input.size(): {input.size()}")
    print(f"input: {input}")
    print(f"padding_idx: {padding_idx}")
    print(f"scale_grad_by_freq: {scale_grad_by_freq}")
    print(f"sparse: {sparse}")

Here’s what I see for the values when it crashes:

*******************
len(weight): 209
weight: Parameter containing:
tensor([[-0.0065,  0.0223,  0.0171,  ..., -0.0050,  0.0034,  0.0075],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0193, -0.0405,  0.0071,  ...,  0.0088, -0.0047,  0.0087],
        ...,
        [ 0.0138,  0.0307,  0.0120,  ..., -0.0131, -0.0176, -0.0025],
        [ 0.0308,  0.0078,  0.0159,  ..., -0.0149, -0.0200, -0.0039],
        [-0.0060, -0.0121, -0.0120,  ..., -0.0102,  0.0054,  0.0141]],
       requires_grad=True)
input.size(): torch.Size([8, 512])
input: tensor([[  2,   3,   4,  ..., 511, 512, 513],
        [  2,   3,   4,  ..., 511, 512, 513],
        [  2,   3,   4,  ..., 511, 512, 513],
        ...,
        [  2,   3,   4,  ..., 511, 512, 513],
        [  2,   3,   4,  ..., 511, 512, 513],
        [  2,   3,   4,  ..., 511, 512, 513]])
padding_idx: 1
scale_grad_by_freq: False
sparse: False

Interestingly, the debug code triggers twice before crashing on the third invocation. Each time the values are different, e.g. the first time it calls embeddings() the length of weight is 52000.

It’s definitely using the config.json, that’s for sure.

OK, I found the mismatch. When I call run_mlm.py, and pass it --max_seq_length 512, the config.json needs to have max_position_embeddings set to 512 +2.

NOW it trains

1 Like

Unfortunately it doesn’t seem to be as simple as I thought. Also, it doesn’t seem consisten? Doing some tests:

  • cli flag 512, config.json 514: I’m getting the error again.
  • cli 512, config 1024: crashes with the same error
  • cli 256: config 514: the training loop runs.
  • cli 207, config 514: the training loop runs.
  • cli 500, config 514: training loop runs
  • cli 508, config 514: training loop runs
  • cli 512, config 514 again: same error again.
  • cli 508, config 514: runs fine again

Tried rm -rf ~/.cache/huggingface/datasets and ran some more tests:

  • 512, 514: I see some progress bars, as it apparently reruns some dataset stuff… then I get the indexError again.
  • 512, 516: error
  • 512, 700: error
  • 512, 1024: error
  • 512, 1028: error
  • 510, 514: two progress bars again, as it does some sort of dataset processing, then training loop runs fine.
  • 513, 514: progress bars and then error.
  • 511, 514: progress bars and error
  • 510, 514: trains fine
  • 510, 1024: trains fine
  • 510, 256: trains fine!!

I’m stumped. @sgugger got any ideas?

  • 207, 514 also trains fine.
  • but 510, 12 some how also runs without error.

:man_shrugging:

I’ve been looking into how the tokenizer gets loaded. I’ve got a vocab.json and a merges.txt. I added a print statement to run_mlm.py right here: transformers/run_mlm.py at 49bee0aea44ef29c08d48f818f356275ef223da8 · huggingface/transformers · GitHub

And I see that the autotokenizer sets the model_max_length to an incredibly large number: model_max_len=1000000000000000019884624838656

06/08/2021 09:07:05 - INFO - __main__ -   tokenizer: PreTrainedTokenizerFast(name_or_path='/home/cleong/projects/personal/colin-summer-2021/EsperBERTo/', vocab_size=52000, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'sep_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'cls_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True)})

I overrode the tokenizer loading code thus:

tokenizer = AutoTokenizer.from_pretrained("./EsperBERTo", max_len=512)

Then I ran it again with cli 512, config 514, it no longer crashes in the same way:

06/08/2021 09:23:56 - INFO - __main__ -   tokenizer: PreTrainedTokenizerFast(name_or_path='./EsperBERTo', vocab_size=52000, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})
06/08/2021 09:23:56 - INFO - __main__ -   Training new model from scratch
  0%|                                                                                                                                                                               | 0/975 [00:00<?, ?ba/s][WARNING|tokenization_utils_base.py:3170] 2021-06-08 09:23:59,091 >> Token indices sequence length is longer than the specified maximum sequence length for this model (610 > 512). Running this sequence through the model will result in indexing errors

Sigh, spoke too soon, it still crashes. It just needs to train a new tokenizer, apparently, which is why I got a progress bar.

I do get a new warning, though:

[WARNING|tokenization_utils_base.py:3170] 2021-06-08 09:23:59,091 >> Token indices sequence length is longer than the specified maximum sequence length for this model (610 > 512). Running this sequence through the model will result in indexing errors

This is interesting, not sure where “610” comes from.

Some more things I noticed:

  • You need to remove the datasets cache before testing things, otherwise you might not see the results of your actions after all.
  • if I add the argument --line_by_line, it goes a while before crashing, before getting an input that’s too long.
  • I pulled out two very long lines from oscar.eo.txt into a separate file, and it crashed immediately.

It seems the truncation of the input isn’t happening right?

OK, I think I finally figured out what’s going on. There seems to be weird interactions between --model_type roberta, which sets the (position_embeddings) to 512, and --config_name which actually reads the json, which in my case was 514.

So if you don’t give it --config_name, it doesn’t actually set the position_embeddings using the config.json, even though it prints the config.json out, and even though it uses the config.json to set some of the tokenizer stuff.

tl;dr, here is the magic combination of flags, this seems to run through the large lines of oscar.eo.txt without crashing.

CLI:

python run_mlm.py \
    --line_by_line \
    --config_name /home/cleong/projects/personal/colin-summer-2021/EsperBERTo/ \
    --tokenizer_name /home/cleong/projects/personal/colin-summer-2021/EsperBERTo/ \
    --train_file /home/cleong/projects/personal/colin-summer-2021/data/oscar.eo.txt \
    --max_seq_length 512 \
    --do_train \
    --output_dir ./output/test-mlmm

config.json:

{
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 52000
}

Also, if you give it both --model_type and --config_name it seems to use the config over the type, based on inspecting the source code in rum_mlm.py