I want to perform finetuning with CTRL with custom control codes. The control codes are specific tokens (not the broad domains as in the paper, wikipedia, finance etc.). Since the
transformers officially supports, it would be great if some information can be provided about finetuning it in Pytorch.
- Control codes have to be prepended before the sequence to get high priority. So do we have to just prepend them ? After prepending, do we add a token as a seperating marker ? The
CTRLTokenizerdoes not seem to use tokens similar to
</s>token ? Even after adding them through
add_special_tokenmethod, the tokenizer does not use them. From this line it seems that
sentence2are not separated through any special tokens. Is this correct ?
- Apart from making the
[-100]for context and making
token_type_idsas 1s for
input_ids[:context_end_position], do we have to do anything else ?
- I’m seeing the warning
BPE merge indices are not consecutive. Please ensure that your tokenizer is not corrupted. Why is this warning being raised ?
- What is the use of
tokenizer.control_codesexactly ? The values are just
How can we generate diverse text better ? Greedy decoding produces deterministic. Beam search is repetitive. Top-k top-p can create random generations. Any tips with parameters. Greedy decoding with repetition penalty produces decent results but sometimes they don’t generate anything (empty output).