ntil now, I thought that the ward2vec part was in tokenizer
So the output of tokinizer is [batch_size, lens, vec] 3D
[['labels', torch.Size([16])], ['input_ids', torch.Size([16, 512,256])], ['token_type_ids', torch.Size([16, 512])], ['attention_mask', torch.Size([16, 512,256])]]
I expected that, but actually
[['labels', torch.Size([16])], ['input_ids', torch.Size([16, 512])], ['token_type_ids', torch.Size([16, 512])], ['attention_mask', torch.Size([16, 512])]]
In other words, is ward2vec included in model?
Is this normal? Also, what is token_type_ids?
I confirmed these facts with “bert-base-cased” and “model/distilbert-base-uncased”
Preprocessing(dataset is yelp_review_full model is bert-base-cased )
def tokenize_function(examples):
return tokenizer(examples["text"], padding=True, truncation=True)
pre_train_model_name="bert-base-cased"
dataset_evel = load_from_disk(dataset_URL[1])
tokenized_datasets_evel = dataset_evel.map(tokenize_function, batched=True)
tokenized_datasets_evel = tokenized_datasets_evel.remove_columns(["text"])
tokenized_datasets_evel = tokenized_datasets_evel.rename_column("label", "labels")
tokenized_datasets_evel.set_format("torch")
test_loader = DataLoader(tokenized_datasets_evel, batch_size=batch_size)