Hello. Using DeepPavlov transformer I was surprised to get different embbedings for the same word ‘шагать’. This is a fictional example showing the essence of the question.
MODEL_NAME = 'DeepPavlov/rubert-base-cased-sentence'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)
model(**tokenizer('шагать шагать', return_tensors='pt', truncation=True, max_length=512)).last_hidden_state.detach().squeeze()
As I can see the tokenizer splits word ‘шагать’ on two tokens: ‘шага’ and ‘##ть’
Output for embbedings is:
tensor([[-0.5780, 0.0937, -0.3210, …, -0.3401, 0.0203, 0.4830],
-
[-0.6516, 0.0278, -0.3610, ..., -0.4095, 0.0527, 0.5094],*
-
[-0.6018, 0.1147, -0.2739, ..., -0.4194, 0.0580, 0.4853],*
-
[-0.6632, 0.0110, -0.3995, ..., -0.3953, 0.0823, 0.4497],*
-
[-0.6711, 0.1017, -0.2829, ..., -0.3797, 0.0994, 0.4285],*
-
[-0.6337, 0.0572, -0.3519, ..., -0.3553, 0.0126, 0.4479]])*
I have expected that vector 1([-0.6516, 0.0278, -0.3610, …, -0.4095, 0.0527, 0.5094] - I guess it corresponds to ‘шага’) will be equal to vector 3 - but I see other values. Same is true and for pair vectors 2,4 (’##ть’).
I guess it is result of my mismatching of how model works. Please explain me wht is wrong in my undestanding…