I’m working on the corrupting spans model on aminoacids on https://huggingface.co/Rostlab/prot_t5_xl_uniref50.
Consider this example:
Given the sentence with aminoacids I would like to predict if there would be an empty space or X in the sentence.
Original sequence: ‘A A A A A A X B B B B B C’
Input: ‘A A A A A A <extra_space_id_0> B B B B B <extra_space_id_1> C’
Target: ‘<extra_space_id_0> X <extra_space_id_1> NOTHING <extra_space_id_2>’
In target I should have only two possible outcomes: X or nothing (yes like binary classification task on sentence).
So how should I tokenize this “empty space”? What should be the correct target to feed in the model?
Are there any other better algorithms how to perform this task?
Thank you very much.