T5 Model, T5 Encoder Model and T5 Model for Conditional Generation

I’m trying to understand what the difference is between the models from the topic.

My understanding so far:
The T5 model is built like an encoder-decoder setup (similar to Autoencoder - I guess?). The T5 Encoder Model is the mentioned encoder part.

Assuming I have a tokenized sentence of length N. Applying the T5 encoder part results in a tensor of size N x d_model, corresponding to N embedding vectors.
(1.) a. What happens if I apply the whole model? What’s the output dimensionality?
(1.) b. What would the output be and how could I use it differently from the encoder part? Is it meaningful?

Regarding the Conditional Generation Model: Given an input sequence it can generate an output sequence (e.g. translated)
(2.) a. What is the general purpose of the conditional generation T5 Model such that the normal T5 Model does not suffice?
(2.) b. Where do they differ?

Moreover, assuming I have pairs of sentences in 2 languages:
(“Hello, I am Bob”, “Hola, soy bob”)
(3.) a. How should I train / fine-tune the T5 Model to provide me with an embedding space that reflects properties from both languages? Is that even possible?
(3.) b. Would I need to use the conditional generation T5 Model?
(3.) c. How would I prepare my data for this task?
(3.) d. Assuming I have a T5 Model trained via MLM for English, can I adopt it for translation purposes through fine-tuning somehow or would I need to train a T5 Model from scratch?

t5 moel have encoder and decoder parts. You give input texts that model needs to look to generate output. encoder model encode your input texts to tensor whose shape is [number_of_word,512] and this encoded info gives to decoder model. Decoder model gets your encoded text and gets previous generated tokens(decoder_inputs) and generates a new tensor whose shape is [number_of_word_in_decoder,50127](here 50127 is your vocabulary size ı couldnt remember correct number but it is higher than 50000).

the point here is if you import model without Conditional Generation it comes without lmhead which is a layer that converts decoder models hidden state(diamention is 512) to vocab size(50127).
I have been working on t5 model for 1 month if you have other questions you can write it or my email is enesmahmut3774@gmail.com.

1 Like