BORT: Optimal Subarchitecture Extraction for BERT

Hi guys,
Wondering if anyone has read the new paper from the Alexa team regarding BERT size reduction.

If anyone has any thoughts on it or would like to discuss please comment here.


Super interesting, thanks for sharing!! Perhaps @VictorSanh can give us the best comments :smiley:

Wondering if the same technique can be efficiently used for the giant models like T5-11B and GPT-3