Any incompatibility of gradient_accumulation with the streaming data?

sgmax · July 10, 2023, 1:34am

Hello,

I’m trying to train a model with streaming data using gradient accumulation on 8 GPUs in a single node. However, I found some strange results that I have never seen before:

When I increase gradient_accumulation, the result (both training/eval loss) become much worse than a smaller gradient_accumulation (at the same training steps), which is on contrary to the common sense that increasing gradient accumulation should improve the result because of large global batch sizes.

So, I wonder if there is some incompatibility of gradient accumulation with the streaming data.

FYI, following are my code snippet for streaming dataset and gradient accumulation (training the LLaMa with LoRA in PEFT):

Dataset

train_file = [f"{idx}.jsonl" for idx in range(100000)]
eval_file = "valid.jsonl"

print("Loading dataset...")

dataset = load_dataset("json", data_files={"train": train_file, "eval": eval_file}, streaming=True)
dataset = dataset.with_format("torch")
train_dataset = dataset["train"]
train_dataset = train_dataset.shuffle(buffer_size=500)
eval_dataset = dataset["eval"]

train_dataset = train_dataset.map(tokenize_function, batched=True)

TrainingArg:
training_args = TrainingArguments(
output_dir=args.save_dir,
overwrite_output_dir=True,
num_train_epochs=epochs,
max_steps=max_steps,
per_device_train_batch_size=args.batch_size,
per_device_eval_batch_size=args.batch_size,
eval_accumulation_steps=8,
warmup_steps=args.warmup,
save_strategy=“steps”,
save_steps=5000,
evaluation_strategy=“steps”,
eval_steps=5000,
logging_steps=100,
log_on_each_node=False,
logging_dir=args.save_dir + “/logs”,
learning_rate=args.lr,
gradient_accumulation_steps=8,
fp16=True,
do_train=True)

I really appreciate any comments or suggestions. BTW, is there any way to log the global batch size? I even doubt if the global batch size is not correct as I thought.

Topic		Replies	Views
Performing gradient accumulation with Accelerate 🤗Accelerate	3	573	March 4, 2024
Gradient accumulation: should I duplicate data? 🤗Transformers	7	1013	February 1, 2021
How is it possible to get GPU memory errors when increasing the gradient_accumulation steps? Intermediate	1	1354	January 22, 2024
Errors when using gradient accumulation with FSDP + PEFT LoRA + SFTTrainer 🤗Accelerate	2	1065	February 6, 2025
Using gradient_accumulation_steps does not give the same results 🤗Accelerate	0	516	February 18, 2023

Any incompatibility of gradient_accumulation with the streaming data?

Related topics