BERT embeddings on big dataset

Hello friends, I am looking to get BERT embeddings on a dataset with ~20M rows. I’m able to get padded lists of tokens using:

model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_dataset = dataset_1.map(lambda x: tokenizer(x["charge_metadata__email"], padding = "longest"), batched=True)

But can’t get past this point. I am trying to turn the lists of tokens and attention masks into tensors for model inference but am getting this error:

token_ids = tokenized_dataset.map(lambda x: torch.tensor(x["input_ids"]).unsqueeze(0), batched=True)

TypeError: Provided function which is applied to all elements of table returns a variable of type <class ‘torch.Tensor’>. Make sure provided function returns a variable of type dict (or a pyarrow table) to update the dataset or None if you are only interested in side effects.

Curious if this is the right approach.

hi @simonberrebi
I’m not sure if I follow well, but tokenized_dataset is a DatasetDict, right?

If you want, you can remove irrelevant columns with: tokenized_dataset.map(remove_columns=["blabla","blabla2"])

But I’m not sure if you need it. If you’re trying to train a new model, why you don’t use data_collator as explained here::

Edit:
It looks you don’t even need data_collator.

Note that when you pass the tokenizer as we did here, the default data_collator used by the Trainer will be a DataCollatorWithPadding as defined previously, so you can skip the line data_collator=data_collator in this call.

1 Like

Hey @simonberrebi ,
I think you should return a dict instead of a tensor (as the error msg suggests). So if you put your tensors inside a dict it might just work as expected. Let us know whether it worked :slight_smile:
Best,
M

1 Like

That works great, than you friends!

1 Like