Any incompatibility of gradient_accumulation with the streaming data?

Hello,

I’m trying to train a model with streaming data using gradient accumulation on 8 GPUs in a single node. However, I found some strange results that I have never seen before:

When I increase gradient_accumulation, the result (both training/eval loss) become much worse than a smaller gradient_accumulation (at the same training steps), which is on contrary to the common sense that increasing gradient accumulation should improve the result because of large global batch sizes.

So, I wonder if there is some incompatibility of gradient accumulation with the streaming data.

FYI, following are my code snippet for streaming dataset and gradient accumulation (training the LLaMa with LoRA in PEFT):

Dataset

train_file = [f"{idx}.jsonl" for idx in range(100000)]
eval_file = "valid.jsonl"

print("Loading dataset...")

dataset = load_dataset("json", data_files={"train": train_file, "eval": eval_file}, streaming=True)
dataset = dataset.with_format("torch")
train_dataset = dataset["train"]
train_dataset = train_dataset.shuffle(buffer_size=500)
eval_dataset = dataset["eval"]

train_dataset = train_dataset.map(tokenize_function, batched=True)

TrainingArg:
training_args = TrainingArguments(
output_dir=args.save_dir,
overwrite_output_dir=True,
num_train_epochs=epochs,
max_steps=max_steps,
per_device_train_batch_size=args.batch_size,
per_device_eval_batch_size=args.batch_size,
eval_accumulation_steps=8,
warmup_steps=args.warmup,
save_strategy=“steps”,
save_steps=5000,
evaluation_strategy=“steps”,
eval_steps=5000,
logging_steps=100,
log_on_each_node=False,
logging_dir=args.save_dir + “/logs”,
learning_rate=args.lr,
gradient_accumulation_steps=8,
fp16=True,
do_train=True)

I really appreciate any comments or suggestions. BTW, is there any way to log the global batch size? I even doubt if the global batch size is not correct as I thought.