Regarding max steps, streaming in language modeling

Since --max_steps is needed when streaming is switched on -
If my training data has 10B tokens, seq_len or block_size of 1024, global batch size 128:
then for say 5 epochs is my calculation for max_steps, correct?

Calculation for max_steps:

  1. (1024 * 128) tokens per step.
  2. max_steps = (10B / (1024 * 128)) * 5

Hi @Palash123, that looks right to me. Can you specify the exact example you’re referring to in order to make sure of this?

Hi @regisss, this is regarding an example from language-modeling.

python run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --output_dir /tmp/test-clm \
    --gaudi_config_name Habana/gpt2 \
    --use_habana \
    --use_lazy_mode \
    --use_hpu_graphs_for_inference \
    --throughput_warmup_steps 3 \
    --streaming \
    --max_steps 1000 \
    --do_eval

Thanks, so yeah your calculation looks right :slight_smile: