Hi everyone, i’m trying to run finetune.py to distill a pegasus student (16-2) on XSUM.
The problem is probably in the encoding of the source data i guess:
File "D:\Repos\transformers\examples\seq2seq\utils.py", line 147, in get_char_lens return [len(x) for x in Path(data_file).open().readlines()] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4245: character maps to <undefined>
This is the command i launch:
python finetune.py --learning_rate=1e-4 --do_train --do_predict --n_val 1000 --val_check_interval
0.25 --max_source_length 512 --max_target_length 56 --freeze_embeds --label_smoothing 0.1 --
adafactor --task summarization_xsum --data_dir xsum --train_batch_size=1 --eval_batch_size=1 --
output_dir distilpeg_xsum_sft_16_2 --num_train_epochs 6 --model_name_or_path
distilpeg_xsum_16_2 --gpus 1
Maybe there are unsupported characters in the data?
Thanks in advance!