My goal and what Ive had minor success doing is further finetuning the prithivida/parrot_paraphraser_on_T5 on my indvidual corpus. The goal is to generate the paraphrases in my unique style and sentence structure. Im using lora and adapters to train the model using a dataset format like this with 215 examples (I know its small):
{“input”: “paraphrase: We have three today”, “target”: “Today there are only three.”}
With the target containing my sentence and the input is simply my sentence ran through the base paraphraser.
The base paraphraser generates paraphrases that are ai detector passable and quite good. After fine tuning the results are some what good, the level of style transfer varies, but the detetability has shot through the roof. Here is my inference setup:
def generate_paraphrases(input_text, model, tokenizer, max_length=64, top_k=100, top_p=0.95, temperature=1.5, num_return_sequences=8, num_beams=4, num_beam_groups=4, num_logit_biases=50):
# Encode the input text prefixed with ‘paraphrase:’
input_ids = tokenizer.encode("paraphrase: " + input_text, return_tensors=“pt”).to(model.device)
punctuation_tokens = tokenizer.convert_tokens_to_ids(['.', '!', '?'])
stopping_criteria = StoppingCriteriaList([PunctuationStoppingCriteria(tokenizer, punctuation_tokens)])
logit_bias = torch.zeros(tokenizer.vocab_size).to(model.device)
for word, bias in list(word_bias.items())[:num_logit_biases]:
token_id = tokenizer.convert_tokens_to_ids(word)
if token_id < logit_bias.size(0):
logit_bias[token_id] = bias * 10 # Scaling the bias
logits_processor = BiasLogitsProcessor(logit_bias, tokenizer.vocab_size)
beam_output = model.generate(
input_ids,
max_length=max_length,
num_beams=num_beams,
num_return_sequences=num_beams,
early_stopping=True,
no_repeat_ngram_size=2,
stopping_criteria=stopping_criteria,
diversity_penalty=2.0, # Increased diversity penalty
num_beam_groups=num_beam_groups,
logits_processor=[logits_processor]
)
paraphrases = []
for beam in beam_output:
sampled_outputs = model.generate(
beam.unsqueeze(0).to(model.device),
max_length=max_length,
do_sample=True,
top_k=top_k,
top_p=top_p,
temperature=temperature,
num_return_sequences=num_return_sequences // num_beams,
stopping_criteria=stopping_criteria,
logits_processor=[logits_processor]
)
paraphrases.extend([tokenizer.decode(ids, skip_special_tokens=True) for ids in sampled_outputs])
paraphrases = list(set(paraphrases))
return paraphrases, logit_bias, num_beams
Any feedback or things I should implement to enchance this and achieve the results of a good paraphrase, that represents my style and is undetectable? Or am I possibly going overkill and can achieve a similar effect without fine tuning at all