Converting input text sequences for relation extraction/classification

I’m fine-tuning some pre-trained language models (mostly BERT-like models) for a relation classification task. Input to my task is a relation in a context where the relation has two text spans in context as its arguments. For example:

context: "I <s1>closed the windows</s1> because <s2>room was cold</s2>."

Relation: causal between s1 and s2.

Now I want to convert this input context with the relation spans into a sequence for BERT. This is the way I do it (let’s assume the words are tokenized by BERT’s tokenizer):

[CLS] I [unused1] closed the windows [unused2] because [unused3] room was cold [unused4] . [SEP]

Where [unused]* tokens are from the BERT vocabulary. I wonder if this is the best way to feed my input sequences to a model like BERT? My main goal here is for the model to understand the boundaries of the spans in context and preferably know the start and end of each text span in the context.

Do I also need to make any changes to my model design to understand those [unused]* tokens or do you think it is necessary to add any positional embedding for the spans?