I am trying to make a multi model which uses both speech and text to predicting emotions in spontaneous conversations (I perform the experiments on IEMOCAP). I face some issues with the wav2vec2 transformer because I do not know how to apply the mask manually. I tried to search different solutions, but I saw that everybody uses the training pipeline. The pipeline does not help me because I need to fuse the hidden states together with the hidden states of the text transformer.
Could somebody give me a sample code of how to apply the mask manually?