I’ve seen the docs on how to define the encoding format with TemplateProcessing
. For example for BERT it would be :
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[("[CLS]", 1), ("[SEP]", 0)],
)
But what about encoder-decoder ? What I want to achieve is :
- The encoder get a sentence with CLS and SEP :
<cls> sen1 <sep>
- The decoder get a sentence with BOS and EOS :
<s> sen2 </s>
I call my tokenizer like : x = tokenizer([sen1], text_target=[sen2])
So my question is : how can I define TemplateProcessing
to achieve this kind of format ?