T5 spans - need to predict EMPTY or not

Hello!

I’m working on the corrupting spans model on aminoacids on https://huggingface.co/Rostlab/prot_t5_xl_uniref50.

Consider this example:

Given the sentence with aminoacids I would like to predict if there would be an empty space or X in the sentence.

Original sequence: ‘A A A A A A X B B B B B C’
Input: ‘A A A A A A <extra_space_id_0> B B B B B <extra_space_id_1> C’
Target: ‘<extra_space_id_0> X <extra_space_id_1> NOTHING <extra_space_id_2>’

In target I should have only two possible outcomes: X or nothing (yes like binary classification task on sentence).

  1. So how should I tokenize this “empty space”? What should be the correct target to feed in the model?

  2. Are there any other better algorithms how to perform this task?

Thank you very much.