Does anyone know what size vectors the BERT and Transformer-XL models take and output?
For example, I know that bert-large is 24-layer, 1024-hidden, 16-heads per block, 340M parameters. (bert-base is 12 heads per block) does that mean it takes a vector size of [24,1024,16]? or am I miss understanding?
Any help is much appreciated