Finetuning CTRL

I want to perform finetuning with CTRL with custom control codes. The control codes are specific tokens (not the broad domains as in the paper, wikipedia, finance etc.). Since the transformers officially supports, it would be great if some information can be provided about finetuning it in Pytorch.

  • Control codes have to be prepended before the sequence to get high priority. So do we have to just prepend them ? After prepending, do we add a token as a seperating marker ? The CTRLTokenizer does not seem to use tokens similar to [CLS] [SEP] or </s> token ? Even after adding them through add_special_token method, the tokenizer does not use them. From this line it seems that control codes, sentence1 and sentence2 are not separated through any special tokens. Is this correct ?
  • Apart from making the labels as [-100] for context and making token_type_ids as 1s for input_ids[:context_end_position], do we have to do anything else ?
  • I’m seeing the warning BPE merge indices are not consecutive. Please ensure that your tokenizer is not corrupted. Why is this warning being raised ?
  • What is the use of tokenizer.control_codes exactly ? The values are just input_ids of keys.

How can we generate diverse text better ? Greedy decoding produces deterministic. Beam search is repetitive. Top-k top-p can create random generations. Any tips with parameters. Greedy decoding with repetition penalty produces decent results but sometimes they don’t generate anything (empty output).