I want to perform finetuning with CTRL with custom control codes. The control codes are specific tokens (not the broad domains as in the paper, wikipedia, finance etc.). Since the transformers
officially supports, it would be great if some information can be provided about finetuning it in Pytorch.
- Control codes have to be prepended before the sequence to get high priority. So do we have to just prepend them ? After prepending, do we add a token as a seperating marker ? The
CTRLTokenizer
does not seem to use tokens similar to[CLS]
[SEP]
or</s>
token ? Even after adding them throughadd_special_token
method, the tokenizer does not use them. From this line it seems thatcontrol codes
,sentence1
andsentence2
are not separated through any special tokens. Is this correct ? - Apart from making the
labels
as[-100]
for context and makingtoken_type_ids
as 1s forinput_ids[:context_end_position]
, do we have to do anything else ? - I’m seeing the warning
BPE merge indices are not consecutive. Please ensure that your tokenizer is not corrupted
. Why is this warning being raised ? - What is the use of
tokenizer.control_codes
exactly ? The values are justinput_ids
of keys.
How can we generate diverse text better ? Greedy decoding produces deterministic. Beam search is repetitive. Top-k top-p can create random generations. Any tips with parameters. Greedy decoding with repetition penalty produces decent results but sometimes they don’t generate anything (empty output).