Fine tuning t5 to write like me

will4381 · May 26, 2024, 2:55am

My goal and what Ive had minor success doing is further finetuning the prithivida/parrot_paraphraser_on_T5 on my indvidual corpus. The goal is to generate the paraphrases in my unique style and sentence structure. Im using lora and adapters to train the model using a dataset format like this with 215 examples (I know its small):

{“input”: “paraphrase: We have three today”, “target”: “Today there are only three.”}

With the target containing my sentence and the input is simply my sentence ran through the base paraphraser.

The base paraphraser generates paraphrases that are ai detector passable and quite good. After fine tuning the results are some what good, the level of style transfer varies, but the detetability has shot through the roof. Here is my inference setup:

def generate_paraphrases(input_text, model, tokenizer, max_length=64, top_k=100, top_p=0.95, temperature=1.5, num_return_sequences=8, num_beams=4, num_beam_groups=4, num_logit_biases=50):
# Encode the input text prefixed with ‘paraphrase:’
input_ids = tokenizer.encode("paraphrase: " + input_text, return_tensors=“pt”).to(model.device)

punctuation_tokens = tokenizer.convert_tokens_to_ids(['.', '!', '?'])

stopping_criteria = StoppingCriteriaList([PunctuationStoppingCriteria(tokenizer, punctuation_tokens)])

logit_bias = torch.zeros(tokenizer.vocab_size).to(model.device)
for word, bias in list(word_bias.items())[:num_logit_biases]:
    token_id = tokenizer.convert_tokens_to_ids(word)
    if token_id < logit_bias.size(0):
        logit_bias[token_id] = bias * 10  # Scaling the bias

logits_processor = BiasLogitsProcessor(logit_bias, tokenizer.vocab_size)


beam_output = model.generate(
    input_ids,
    max_length=max_length,
    num_beams=num_beams,
    num_return_sequences=num_beams,
    early_stopping=True,
    no_repeat_ngram_size=2,
    stopping_criteria=stopping_criteria,
    diversity_penalty=2.0,  # Increased diversity penalty
    num_beam_groups=num_beam_groups,  
    logits_processor=[logits_processor]
)

paraphrases = []
for beam in beam_output:
    sampled_outputs = model.generate(
        beam.unsqueeze(0).to(model.device),
        max_length=max_length,
        do_sample=True,
        top_k=top_k,
        top_p=top_p,
        temperature=temperature,
        num_return_sequences=num_return_sequences // num_beams,
        stopping_criteria=stopping_criteria,
        logits_processor=[logits_processor]
    )
    paraphrases.extend([tokenizer.decode(ids, skip_special_tokens=True) for ids in sampled_outputs])

paraphrases = list(set(paraphrases))

return paraphrases, logit_bias, num_beams

Any feedback or things I should implement to enchance this and achieve the results of a good paraphrase, that represents my style and is undetectable? Or am I possibly going overkill and can achieve a similar effect without fine tuning at all

Topic		Replies	Views
Fine-tune T5 for paraphrase generation Beginners	1	929	July 12, 2023
Finetune T5 with T5ForConditionalGeneration to multitask for Q&A and Summarization 🤗Transformers	0	635	November 28, 2023
T5: Tips for finetuning on crossword clues (clue => answer) Models	1	629	October 14, 2020
Finetuning T5 for Summarisation - Poor results Intermediate	1	527	April 28, 2024
Finetuning T5 for a task Intermediate	21	6917	September 3, 2022

Fine tuning t5 to write like me

Related topics