BERT embeddings on big dataset

simonberrebi · August 27, 2024, 11:08pm

Hello friends, I am looking to get BERT embeddings on a dataset with ~20M rows. I’m able to get padded lists of tokens using:

model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_dataset = dataset_1.map(lambda x: tokenizer(x["charge_metadata__email"], padding = "longest"), batched=True)

But can’t get past this point. I am trying to turn the lists of tokens and attention masks into tensors for model inference but am getting this error:

token_ids = tokenized_dataset.map(lambda x: torch.tensor(x["input_ids"]).unsqueeze(0), batched=True)

TypeError: Provided function which is applied to all elements of table returns a variable of type <class ‘torch.Tensor’>. Make sure provided function returns a variable of type dict (or a pyarrow table) to update the dataset or None if you are only interested in side effects.

Curious if this is the right approach.

mahmutc · August 28, 2024, 12:00pm

hi @simonberrebi
I’m not sure if I follow well, but tokenized_dataset is a DatasetDict, right?

If you want, you can remove irrelevant columns with: tokenized_dataset.map(remove_columns=["blabla","blabla2"])

But I’m not sure if you need it. If you’re trying to train a new model, why you don’t use data_collator as explained here::

Edit:
It looks you don’t even need data_collator.

Note that when you pass the tokenizer as we did here, the default data_collator used by the Trainer will be a DataCollatorWithPadding as defined previously, so you can skip the line data_collator=data_collator in this call.

mikehemberger · August 28, 2024, 1:49pm

Hey @simonberrebi ,
I think you should return a dict instead of a tensor (as the error msg suggests). So if you put your tensors inside a dict it might just work as expected. Let us know whether it worked
Best,
M

simonberrebi · August 28, 2024, 4:19pm

That works great, than you friends!

Topic		Replies	Views
Finetuning Bert for Question answering task without context Models	1	616	June 21, 2024
DataCollatorWithPaddings without Tokenizer Beginners	3	336	October 25, 2021
Key error: 0 in DataCollatorForSeq2Seq for BERT Beginners	10	3990	March 13, 2024
Make bert inference faster 🤗Transformers	6	10776	September 16, 2021
Using a dataset with already masked tokens Beginners	2	702	February 3, 2021

BERT embeddings on big dataset

Related topics