Distilgpt2 pre-training configuration

Hi everyone!
As part of our research work, we are attempting to re-produce distilgpt2. We downloaded openwebtext, binazrized it as indicated here, extracted the students weights, used the (almost) same configurations as in research_projects/distillation:

python -m torch.distributed.launch \
    --nproc_per_node=$N_GPU_NODE \
    --nnodes=$N_NODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    train.py \
        --fp16 \
        --force \
        --gpus $WORLD_SIZE \
        --student_type gpt2 \
        --student_config training_configs/distilgpt2.json \
        --student_pretrained_weights ./student/pytorch_model.bin \
        --teacher_type gpt2 \
        --teacher_name gpt2 \
        --alpha_ce 5.0 --alpha_cos 1.0 --alpha_clm 0.5  \
        --freeze_pos_embs \
        --dump_path my_dir \
        --data_file data/owt.pickle \
        --token_counts data/token_owt.pickle

We kept the default values for the rest of hyper-parameters. However, the model is not converging (perplexity over 80 for wikitext103 test set). Can anyone confirm whether the settings above are correct or not? Thanks a lot!

cc @VictorSanh

Sorry, it took some time, I found the exact configuration!

{
    "force": true,
    "dump_path": "serialization_dir/distilbert_gpt2_w_freezing",
    "data_file": "data/dump_openwebtext.gpt2.pickle",
    "student_type": "gpt2",
    "student_config": "training_configs/gpt2.json",
    "student_pretrained_weights": "serialization_dir/gpt2_0247911.pth",
    "teacher_type": "gpt2",
    "teacher_name": "gpt2",
    "temperature": 2.0,
    "alpha_ce": 5.0,
    "alpha_mlm": 0.0,
    "alpha_clm": 2.0,
    "alpha_mse": 0.0,
    "alpha_cos": 1.0,
    "mlm": false,
    "mlm_mask_prop": 0.15,
    "word_mask": 0.8,
    "word_keep": 0.1,
    "word_rand": 0.1,
    "mlm_smoothing": 0.7,
    "token_counts": null,
    "restrict_ce_to_mask": false,
    "freeze_pos_embs": true,
    "freeze_token_type_embds": false,
    "n_epoch": 4,
    "batch_size": 1,
    "tokens_per_batch": -1,
    "shuffle": true,
    "group_by_size": true,
    "gradient_accumulation_steps": 500,
    "warmup_prop": 0.05,
    "weight_decay": 0.0,
    "learning_rate": 0.00025,
    "adam_epsilon": 1e-06,
    "max_grad_norm": 5.0,
    "initializer_range": 0.02,
    "fp16": false,
    "fp16_opt_level": "O1",
    "n_gpu": 8,
    "local_rank": 0,
    "seed": 56,
    "log_interval": 2000,
    "checkpoint_interval": 60000,
    "world_size": 8,
    "n_gpu_per_node": 8,
    "global_rank": 0,
    "n_nodes": 1,
    "node_id": 0,
    "multi_gpu": true,
    "is_master": true,
    "multi_node": false
}

Thanks a lot @sgugger and @VictorSanh !