Finetuning CTRL

prajjwal1 · January 28, 2021, 5:13pm

I want to perform finetuning with CTRL with custom control codes. The control codes are specific tokens (not the broad domains as in the paper, wikipedia, finance etc.). Since the transformers officially supports, it would be great if some information can be provided about finetuning it in Pytorch.

Control codes have to be prepended before the sequence to get high priority. So do we have to just prepend them ? After prepending, do we add a token as a seperating marker ? The CTRLTokenizer does not seem to use tokens similar to [CLS] [SEP] or </s> token ? Even after adding them through add_special_token method, the tokenizer does not use them. From this line it seems that control codes, sentence1 and sentence2 are not separated through any special tokens. Is this correct ?
Apart from making the labels as [-100] for context and making token_type_ids as 1s for input_ids[:context_end_position], do we have to do anything else ?
I’m seeing the warning BPE merge indices are not consecutive. Please ensure that your tokenizer is not corrupted. Why is this warning being raised ?
What is the use of tokenizer.control_codes exactly ? The values are just input_ids of keys.

How can we generate diverse text better ? Greedy decoding produces deterministic. Beam search is repetitive. Top-k top-p can create random generations. Any tips with parameters. Greedy decoding with repetition penalty produces decent results but sometimes they don’t generate anything (empty output).

katze · June 10, 2022, 3:25pm

Hi @prajjwal1, did you in the end manage to finetune with new control codes? IF so, did you need to add separators after the first token?

Topic		Replies	Views
Finetuning Whisper with prompts 🤗Transformers	3	4091	January 16, 2024
Does AutoTokenizer.from_pretrained add [cls] tokens? 🤗Tokenizers	7	5282	March 2, 2021
Errors when fine-tuning T5 Beginners	7	6472	January 3, 2022
Padding and truncation for custom tokenizer 🤗Tokenizers	1	643	January 22, 2023
Fine-tuned transformers model generats nonsensical results Beginners	0	216	July 10, 2024

Finetuning CTRL

Related topics