You need at least 12GB of GPU RAM for to put the model on the GPU and your GPU has less memory than that, so you won’t be able to use it on the GPU of this machine. You can’t use it in half precision on CPU because all layers of the models are not implemented for half precision (like the layernorm layer) so you need to use the model in full precision on the CPU to make predictions (that will take a looooooooong time).
AS for the RAM footprint, we are working on a way to load the model with from_pretrained
to only consume the model memory in RAM (currently it consumes twice the model size). It should be merged soon.