I have a datasets.Dataset
built from list of texts. This dataset I tokenize using Dataset.map
method:
from datasets import Dataset
from transformers import AutoModel, AutoTokenizer
checkpoint = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint)
batch_size = 4
texts = [
{ 'text': 'first sentence' },
{ 'text': 'second sentence' },
# ...
{ 'text': 'billionth sentence' }
]
dataset = Dataset.from_list(texts)
tokenized_dataset = dataset.map(
lambda x: tokenizer(
x['text'],
padding=True,
truncation=True,
return_tensors='pt'
),
batched=True,
batch_size= batch_size)
tokenized_dataset
contains list
s instead of Tensor
s. I found this thread and I set the format:
tokenized_dataset.set_format(
'pt',
columns=['input_ids', 'attention_mask'],
output_all_columns=True
)
After that I tried to predict for all tokenized_dataset
and prediction obviously failed with AttributeError
:
embeddings = model(tokenized_dataset)
# AttributeError: 'Dataset' object has no attribute 'size'
Could you please point me out what I’m doing wrong and how it’s possible to get embeddings for all items in tokenized_dataset
on single call?
PS: I want something like this but with batching:
tokens = tokenizer(
['first sentence', 'second sentence', ..., 'billionth sentence'],
return_tensors='pt',
padding=True,
truncation=True
)
embeddings = model(**tokens)