Hang in language modelling script

Hi all,
I’m using the script examples/language_modelling/run_language_modeling.py and the script freezes on the first file lock.

Here’s the command plus log

python run_language_modeling.py \
       --output_dir=output \
       --model_name_or_path=camembert-base \
       --do_train \
       --train_data_files='/home/theo_nabla_com/data/mydata-corpus/chunk*' \
       --do_eval \
       --eval_data_file=/home/theo_nabla_com/data/mydata-corpus/valid \
       --mlm \
       --whole_word_mask
10/29/2020 08:09:13 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: False
10/29/2020 08:09:13 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Oct29_08-09-13_google3-theo', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='output', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None)
10/29/2020 08:09:13 - DEBUG - urllib3.connectionpool -   Starting new HTTPS connection (1): s3.amazonaws.com:443
10/29/2020 08:09:14 - DEBUG - urllib3.connectionpool -   https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/camembert-base-config.json HTTP/1.1" 200 0
10/29/2020 08:09:14 - DEBUG - urllib3.connectionpool -   Starting new HTTPS connection (1): s3.amazonaws.com:443
10/29/2020 08:09:14 - DEBUG - urllib3.connectionpool -   https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/camembert-base-config.json HTTP/1.1" 200 0
10/29/2020 08:09:14 - DEBUG - urllib3.connectionpool -   Starting new HTTPS connection (1): s3.amazonaws.com:443
10/29/2020 08:09:14 - DEBUG - urllib3.connectionpool -   https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/camembert-base-sentencepiece.bpe.model HTTP/1.1" 200 0
/home/theo_nabla_com/code/transformers/src/transformers/modeling_auto.py:822: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
  FutureWarning,
10/29/2020 08:09:14 - DEBUG - urllib3.connectionpool -   Starting new HTTPS connection (1): cdn.huggingface.co:443
10/29/2020 08:09:14 - DEBUG - urllib3.connectionpool -   https://cdn.huggingface.co:443 "HEAD /camembert-base-pytorch_model.bin HTTP/1.1" 200 0
Some weights of CamembertForMaskedLM were not initialized from the model checkpoint at camembert-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/home/theo_nabla_com/code/transformers/src/transformers/tokenization_utils_base.py:1421: FutureWarning: The `max_len` attribute has been deprecated and will be removed in a future version, use `model_max_length` instead.
  FutureWarning,
10/29/2020 08:09:19 - DEBUG - filelock -   Attempting to acquire lock 140320072690936 on /home/theo_nabla_com/data/mydata-corpus/cached_lm_CamembertTokenizer_510_chunkaj.lock
10/29/2020 08:09:19 - INFO - filelock -   Lock 140320072690936 acquired on /home/theo_nabla_com/data/mydata-corpus/cached_lm_CamembertTokenizer_510_chunkaj.lock

And my training dir is ~200 files of ~30MB, as per documentation instructions to keep train files small for the tokenizer (however since I’m finetuning from CamemBERT I wouldn’t expect a tokenizer “train” to be run?)
I’m unable to figure out why this freezes, looking for pointers :slight_smile:

edit: getting the same behaviour with a single 30MB training file