When performing tasks besides seq2seq generation am I still able to use T5’s input ids or how do people typically process this information? It seems if I build an AutoEncoder model to learn the relation between input_ids the loss is very large. Does it make sense to:
-
Use the model’s encoder via
model.encoder(input_ids = enc['input_ids'], attention_mask=enc['attention_mask'], return_dict=True)
. If so, how do I decode later with my AutoEncoder’s learned embeddings?tokenizer.decode
expects id’s whereas my model will be producing embeddings. -
Use the tokenizer’s input id’s? If so, these values break my loss function as id’s become very large. Is there a way around this? I cannot normalize as I need to ultimately decode to validate the loss. Or in this case would I need an embedding layer as in this example?
-
Do not use T5 all together and use a sentence transformer? I am unsure if this will work as my input data is source code not traditional text, and it seems all sentence transformers are trained for pure NLP tasks.
-
Some other method?
At a high level I am just not sure how to structure my data so that I can try using an AutoEncoder while still being able to decode the data later. Any tips?
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')
text = "int numcount=0; void divide(int num,int x) { int i; if(num==1) numcount++; for(i=x;i<=num;i++) { if(num%i==0) divide(num/i,i); } } int main() { int n,num,i; int first=2; int ans[100]; cin>>n; for(i=1;i<=n;i++) { cin>>num; divide(num,first); ans[i]=numcount; //cout<<count<<endl; numcount=0; } for(i=1;i<=n;i++) cout<<ans[i]<<endl; return 0; }"
input_ids = tokenizer(text, return_tensors="pt").input_ids # This does not seem like good training data
...
"""
Current approach high level:
* Train AutoEncoder w/ MSE or CrossEntropy values w/ Input Ids
* Generate new Id's for some test data
* Tokenizer.decode() w/ new Id's
"""