Problems and solution on Trainer

voidful · November 9, 2021, 11:49am

I am using the trainer to train an ASR model, the dataset and the output dimension are huge. This will cause some problems during training. I struggle with it many days, so I post my solution here, hope it can help.

compute_metrics out of memory issue
during compute_metrics, it will save all the logits in an array, when the output dimension is large, it will easily cause out-of-memory on a large dataset. The solution is to use torch.argmax on logits first to avoid saving all the data.
when using trainer on seq2seq model, if the model output contains past_key_value, it will cause length error when merging different output, so past_key_value needs to be dropped on model output.
group_by_length will take a very long processing time on model.train, and it uses a lot of memory to calculate the length of the data.

sgugger · November 9, 2021, 1:02pm

Note that for 3, you can have it in your datasets.Dataset computed once and for all with the map method, you just have to store the resuls in a "lengths" column. It will then uses the Dataset features and not try to access every element.

voidful · November 10, 2021, 12:01pm

model.forward should have labels argument if you have loss return, otherwise during prediction_loop will cause length error.
it needs to add input_ids in dataset(for wav2vecProcessor, input will be input_values, I have to add an extra input_ids field XD)

voidful · December 17, 2021, 6:04pm

It should be length instead of lengths, it can be customize using length_column_name now! That’s nice!

Topic		Replies	Views
How is the eval dataset processed in a trainer? 🤗Transformers	0	507	August 28, 2023
Out of memory error when using trainer & output_hidden_states 🤗Transformers	0	705	January 10, 2023
Multilabel Audio Classification Training size mismatch 🤗Transformers	3	365	February 22, 2024
Input of compute_metrics in ASR model Beginners	2	1321	April 19, 2021
How to process trainer.evaluate in batch mode to deal with Out of Memory error 🤗Datasets	0	330	March 22, 2023

Problems and solution on Trainer

Related topics