I want to perform finetuning with CTRL with custom control codes. The control codes are specific tokens (not the broad domains as in the paper, wikipedia, finance etc.). Since the transformers officially supports, it would be great if some information can be provided about finetuning it in Pytorch.
- Control codes have to be prepended before the sequence to get high priority. So do we have to just prepend them ? After prepending, do we add a token as a seperating marker ? The
CTRLTokenizerdoes not seem to use tokens similar to[CLS][SEP]or</s>token ? Even after adding them throughadd_special_tokenmethod, the tokenizer does not use them. From this line it seems thatcontrol codes,sentence1andsentence2are not separated through any special tokens. Is this correct ? - Apart from making the
labelsas[-100]for context and makingtoken_type_idsas 1s forinput_ids[:context_end_position], do we have to do anything else ? - I’m seeing the warning
BPE merge indices are not consecutive. Please ensure that your tokenizer is not corrupted. Why is this warning being raised ? - What is the use of
tokenizer.control_codesexactly ? The values are justinput_idsof keys.
How can we generate diverse text better ? Greedy decoding produces deterministic. Beam search is repetitive. Top-k top-p can create random generations. Any tips with parameters. Greedy decoding with repetition penalty produces decent results but sometimes they don’t generate anything (empty output).