Questions on distilling [from] T5

A couple of questions/thoughts related to the distillation of the text-generation models:

  1. I see that Distill-T5 is not functional.
    https://github.com/huggingface/transformers/blob/master/examples/seq2seq/distillation.py#L299-L301
    Wondering if there is any concrete technical barrier that prevents us from extending the same distillation that worked for BART to T5.

  2. I wonder if it is more feasible to distill from T5 to BART. Any thoughts?

Tagging @sshleifer since I think you might know more on this topics.

For which task?

  1. t5 distillation is very feasible, I just got excited about bart/pegasus since it performed the best in my summarization experiments. There is no feasability issue.

  2. It is much less feasible to distill from t5 -> bart than to distill from a large finetuned t5 checkpoint to a smaller one.

For which task?

I don’t have any particular task in mind, yet. Just exploring for now.

There is no feasability issue.

I see … thanks for clarifying it.

I just got excited about bart/pegasus since it performed the best in my summarization experiments

Are you suggesting that you got better results with BART, compared with T5?

Re. distilling TPUs: I guess one limitation here is that T5(11B, the teacher) would not fit in many common GPUs; right? I wonder it is possible to pre-extract the teacher logics (say, on a TPU) and just load them in the distiller code. Do you have any thoughts on this issue, @sshleifer?

I wonder it is possible to pre-extract the teacher logics (say, on a TPU) and just load them in the distiller code.

I think this can be achieved with datasets library, we can try to cache the logits along with the examles and while loading example , load its’ corresponding logits as well, so the dataset could return dict which could look something like this

{"input_ids": [], "decoder_input_ids": [], "precomputed_logits": []}
  • Yes, the bart variants finetuned on cnn and xsum perform better than the t5 variants (that aren’t finetuned.)
  • They are slightly better than finetuned t5 variants.
  • I don’t see any reason to use t5-11b, is it better on some task than t5-large?
  • Note that if you want to use the current distillation code you have to fit teacher, student, and batch_size=1 on a single GPU which is unfortunate.

@valhalla can also just generate teacher pseudolabels (less memory than logits, no new code) which I have run some recent experiments on with good results. I will likely check in code for that soon.

How will that work ?

  1. Generate psedo labels from teacher
  2. Use that as an example for student
    ?

Exactly.
Complicated version of that:

hello, I am trying to run your distillation code with T5. As a POC I am just trying to distill from t5-small to t5-small before I can do actual work. I have a script which looks like the following–

--teacher t5-small --data_dir $CNN_DIR \
--student_decoder_layers 6 --student_encoder_layers 6 \
--learning_rate=3e-4 \
--do_train \
--do_predict \
--fp16 \
--model_name_or_path t5-small \
--val_check_interval 0.1 \
--output_dir distilt5 \
"$@"```

and get the following error-
```Traceback (most recent call last):
  File "/home/sumithrab/transformers/src/transformers/configuration_utils.py", line 349, in get_config_dict
    resolved_config_file = cached_path(
  File "/home/sumithrab/transformers/src/transformers/file_utils.py", line 832, in cached_path
    raise EnvironmentError("file {} not found".format(url_or_filename))
OSError: file t5-small/config.json not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "distillation.py", line 297, in <module>
    distill_main(args)
  File "distillation.py", line 287, in distill_main
    model = create_module(args)
  File "distillation.py", line 254, in create_module
    model = module_cls(args)
  File "distillation.py", line 42, in __init__
    teacher = AutoModelForSeq2SeqLM.from_pretrained(hparams.teacher).eval()
  File "/home/sumithrab/transformers/src/transformers/modeling_auto.py", line 1094, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/home/sumithrab/transformers/src/transformers/configuration_auto.py", line 318, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/sumithrab/transformers/src/transformers/configuration_utils.py", line 368, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for 't5-small'. Make sure that:

- 't5-small' is a correct model identifier listed on 'https://huggingface.co/models'

- or 't5-small' is the correct path to a directory containing a config.json file

Any clue as to what I am missing?
Am I supposed to first download the (pretrained) t5-small model locally and if so from where and to what path, and how do I specify the model in this script?

You shouldn’t need to download it locally. What commit of transformers are you at?

AutoModelForSeq2SeqLM.from_pretrained('t5-small')

works for me. Make sure you don’t have a dir called t5-small.

I got T5 working in this PR

Here are some very simple commands that work for me on master.

Yes Teacher/Traditional Distillation

python distillation.py --teacher t5-small --data_dir cnn_dm \
--student_decoder_layers 3 --student_encoder_layers 6 --tokenizer_name t5-small \
--learning_rate=3e-4 --freeze_encoder --no_teacher --freeze_embeds \
--do_train --train_batch_size 32 \
--do_predict --n_train 100 \
--model_name_or_path t5-small --eval_beams 2 --eval_max_gen_length 142 \
--val_check_interval 0.25 --n_val 1000 \
--output_dir distilt5 --gpus 1 --logger_name wandb

No teacher

python make_student.py t5-small t5_small_6_3 6 3

python finetune.py --model_name_or_path t5_small_6_3 --data_dir cnn_dm \
--learning_rate=3e-4 --freeze_encoder --freeze_embeds \
--do_train --train_batch_size 32 \
--do_predict --n_train 100 \
--model_name_or_path t5_small_6_3 --eval_beams 2 --eval_max_gen_length 142 \
--val_check_interval 0.25 --n_val 1000 \
--output_dir distilt5 --gpus 1 --logger_name wandb

Remove n_train=100 to do slightly better.

Thanks @sshleifer: I pulled latest and retried but still get the same error! :frowning:

I tried your commandline (with teacher)–

--student_decoder_layers 3 --student_encoder_layers 6 --tokenizer_name t5-small \
--learning_rate=3e-4 --freeze_encoder --no_teacher --freeze_embeds \
--do_train --train_batch_size 32 \
--do_predict --n_train 100 \
--model_name_or_path t5-small --eval_beams 2 --eval_max_gen_length 142 \
--val_check_interval 0.25 --n_val 1000 \
--output_dir distilt5 --gpus 1 --logger_name wandb

I get–

Traceback (most recent call last):
  File "/home/sumithrab/transformers/src/transformers/configuration_utils.py", line 349, in get_config_dict
    resolved_config_file = cached_path(
  File "/home/sumithrab/transformers/src/transformers/file_utils.py", line 832, in cached_path
    raise EnvironmentError("file {} not found".format(url_or_filename))
OSError: file t5-small/config.json not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "distillation.py", line 306, in <module>
    distill_main(args)
  File "distillation.py", line 296, in distill_main
    model = create_module(args)
  File "distillation.py", line 263, in create_module
    model = module_cls(args)
  File "/home/sumithrab/transformers/examples/seq2seq/finetune.py", line 63, in __init__
    super().__init__(hparams, num_labels=None, mode=self.mode, **kwargs)
  File "/home/sumithrab/transformers/examples/lightning_base.py", line 83, in __init__
    self.config = AutoConfig.from_pretrained(
  File "/home/sumithrab/transformers/src/transformers/configuration_auto.py", line 318, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/sumithrab/transformers/src/transformers/configuration_utils.py", line 368, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for 't5-small'. Make sure that:

- 't5-small' is a correct model identifier listed on 'https://huggingface.co/models'

- or 't5-small' is the correct path to a directory containing a config.json file

Per instructions once I got the repo, I ran pip install -e . and then from the examples folder pip install -r requirements.txt

If I were in your position, I would try again after rm -rf t5-small

then verify in a python repl that
AutoConfig.from_pretrained('t5-small') doesn’t work
Then make a reproducible github issue, including transformers-cli env output.
This is not a distillation issue, and I can’t reproduce it on master.

Thank you so much! I actually didn’t realize I had a t5-small directory – I deleted it and training seems to be running fine!

1 Like

According to your command line, there’s a --no_teacher option when yes teacher, and close the --no_teacher option when train no-teacher model. Is there any mistake?