Hi is there any info available about what model comperssion method was used for creating tiny-tapas (pruning, distallation etc). There is no info on the model card and I have been unsuccesful in finding some online.
Hi,
There’s no distillation happening there. It’s just the tiniest architecture of all TAPAS variants the authors released (which means, only 2 hidden layers as can be seen here, 2 attention heads compared to the bigger base model which uses 12 hidden layers and 12 attention heads).