I would like to do sequence classification over the encoder of the T5 model.
Which hidden state of the last layer should I use for the classification? The hidden state of the last timestep or should I take a mean over all timesteps?
are there any suggestions for this? I am not sure if hidden[:, 0, :] makes sense (since no [CLS] token in T5) but I found that using hidden[:,0,:] is yielding better results than torch.mean(hidden_states, dim=1). Any suggestions on whats the best way to do this in T5Encoder?