How to use T5 for AutoEncoder tasks?

When performing tasks besides seq2seq generation am I still able to use T5’s input ids or how do people typically process this information? It seems if I build an AutoEncoder model to learn the relation between input_ids the loss is very large. Does it make sense to:

  1. Use the model’s encoder via model.encoder(input_ids = enc['input_ids'], attention_mask=enc['attention_mask'], return_dict=True). If so, how do I decode later with my AutoEncoder’s learned embeddings? tokenizer.decode expects id’s whereas my model will be producing embeddings.

  2. Use the tokenizer’s input id’s? If so, these values break my loss function as id’s become very large. Is there a way around this? I cannot normalize as I need to ultimately decode to validate the loss. Or in this case would I need an embedding layer as in this example?

  3. Do not use T5 all together and use a sentence transformer? I am unsure if this will work as my input data is source code not traditional text, and it seems all sentence transformers are trained for pure NLP tasks.

  4. Some other method?

At a high level I am just not sure how to structure my data so that I can try using an AutoEncoder while still being able to decode the data later. Any tips?

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

text = "int numcount=0; void divide(int num,int x) { int i; if(num==1) numcount++; for(i=x;i<=num;i++) { if(num%i==0) divide(num/i,i); } } int main() { int n,num,i; int first=2; int ans[100]; cin>>n; for(i=1;i<=n;i++) { cin>>num; divide(num,first); ans[i]=numcount; //cout<<count<<endl; numcount=0; } for(i=1;i<=n;i++) cout<<ans[i]<<endl; return 0; }"
input_ids = tokenizer(text, return_tensors="pt").input_ids # This does not seem like good training data


Current approach high level:
* Train AutoEncoder w/ MSE or CrossEntropy values w/ Input Ids
* Generate new Id's for some test data
* Tokenizer.decode() w/ new Id's