Pass `Dataset.map` result to model

I have a datasets.Dataset built from list of texts. This dataset I tokenize using Dataset.map method:

from datasets import Dataset
from transformers import AutoModel, AutoTokenizer

checkpoint = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint)
batch_size = 4

texts = [
    { 'text': 'first sentence' },
    { 'text': 'second sentence' },
#  ...
    { 'text': 'billionth sentence' }
]

dataset = Dataset.from_list(texts)

tokenized_dataset = dataset.map(
    lambda x: tokenizer(
        x['text'],
        padding=True,
        truncation=True,
        return_tensors='pt'
    ),
    batched=True,
    batch_size= batch_size)

tokenized_dataset contains lists instead of Tensors. I found this thread and I set the format:

tokenized_dataset.set_format(
    'pt',
    columns=['input_ids', 'attention_mask'],
    output_all_columns=True
)

After that I tried to predict for all tokenized_dataset and prediction obviously failed with AttributeError:

embeddings = model(tokenized_dataset)
# AttributeError: 'Dataset' object has no attribute 'size'

Could you please point me out what I’m doing wrong and how it’s possible to get embeddings for all items in tokenized_dataset on single call?

PS: I want something like this but with batching:

tokens = tokenizer(
    ['first sentence', 'second sentence', ..., 'billionth sentence'],
    return_tensors='pt',
    padding=True,
    truncation=True
)
embeddings = model(**tokens)

Hi! Passing a HF dataset as input to a HF model is not supported. Instead, you can do the following to get the embeddings in a single map call and add them as a column to the dataset:

def add_embeddings(batch):
     output = model(
         **tokenizer(
             batch['text'],
             padding=True,
             truncation=True,
             return_tensors='pt'
        )
     )
     return {"embeddings": output.pooler_output}

dataset_with_embeddings = dataset.map(add_embeddings, batched=True, batch_size=batch_size)

Hi @mariosasko,
Yes, this is the way I thought about. But I want to inference on whole dataset at once (dataset in my case contains limit amount of items).

The reason for that is I want to check the best batch_size for tokenization separately (without inference).

So the only solution I came (looks really wierd) below:

tokenized_dataset = dataset.map(
    lambda batch: tokenizer(batch['text'], padding=True, truncation=True, return_tensors='pt'),
    batched=True,
    batch_size=batch_size)
tokenized_dataset.set_format('pt', columns=['input_ids', 'attention_mask'])
        
max_len = max([len(x) for x in tokenized_dataset['input_ids']])

reshaped_inputs = [x.reshape(1, -1) for x in tokenized_dataset['input_ids']]
reshaped_attension_mask = [x.reshape(1, -1) for x in tokenized_dataset['attention_mask']]

pad_token_id = self.tokenizer.pad_token_id
input_ids = [F.pad(x, (0, max_len-len(x[0]), 0, 0), value=pad_token_id) for x in reshaped_inputs]
attention_masks = [F.pad(x, (0, max_len-len(x[0]), 0, 0), value=0) for x in reshaped_attension_mask]

new_dataset = {
    'input_ids': torch.cat(input_ids),
    'attention_mask': torch.cat(attention_masks)
}

predictions = model(**new_dataset)