Hello friends, I am looking to get BERT embeddings on a dataset with ~20M rows. I’m able to get padded lists of tokens using:
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_dataset = dataset_1.map(lambda x: tokenizer(x["charge_metadata__email"], padding = "longest"), batched=True)
But can’t get past this point. I am trying to turn the lists of tokens and attention masks into tensors for model inference but am getting this error:
token_ids = tokenized_dataset.map(lambda x: torch.tensor(x["input_ids"]).unsqueeze(0), batched=True)
TypeError: Provided
function
which is applied to all elements of table returns a variable of type <class ‘torch.Tensor’>. Make sure providedfunction
returns a variable of typedict
(or a pyarrow table) to update the dataset orNone
if you are only interested in side effects.
Curious if this is the right approach.