I was working on Flan-T5 for weeks and everything was good. Now changing to Llama 2 with many unfamiliar changes, I know one is seq2seq and another is decoder. I have 3 general questions and thanks in advance for advices:
-
I saw different ways: One way is to add some special token like
instruction = f"[INST] {sample[‘Instruction’]}: Question: {sample[‘Question’]} [/INST]"
response = f"Answer: {cleaned_response}"
sample[“text”] = instruction + response + tokenizer.eos_token
But without tokenization, and using SFTTrainer. The other way is to provide text preprocessing, so obtaining ‘input_ids’, ‘attention_mask’, and using Trainer instead of SFTTrainer. Are these two approach both valid? How one is different from the other? -
It seemed to okay to use the default loss, but when I tried to use a compute_metric (e.g., customized metric like Rouge), it always got erros. When I printed the predictions and labels from predictions, labels = eval_pred, I got decimals and negative (see below), which seem to be wrong (Tried the same thing in Flan-T5, it gave integers)
[[[ -5.887479 7.898579 3.145651 … -2.376532 -2.9236283
-4.825788 ]
[ -5.887479 7.898579 3.145651 … -2.376532 -2.9236283
-4.825788 ]
[ -6.1026587 6.603288 2.057882 … -6.1850896 -5.1317277
-2.7486215 ]
…
[ -3.0410564 -0.925742 1.1579411 … -2.830701 -3.4802294
-2.2513897 ]
[ -3.037336 -0.927284 1.1546779 … -2.8302677 -3.476878
-2.2484875 ]
[ -2.997214 -0.9113009 1.1302392 … -2.8121407 -3.4593928
-2.2332683 ]] -
In Flan-T5, we defined clearly question and answer, so input_ids will be tokenized question and labels will be answers. For Llama2, the examples I saw put all the things together. Does that mean during fine-tuning, it learns by itself (self supervised learning) and input_ids and labels are the same? When model is developed, using a testing dataset, input_ids and labels are different?
Thanks again, many confusions!