Access word piece tokens from BERT tokenized dataset

I am new to huggingface and I am trying to figure out if it is possible to get the the word piece tokens when mapping a dataset to a BERT tokenizer. This is what I have right now:

from transformers import AutoTokenizer
from datasets import Dataset

bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function_testing(examples):
    return bert_tokenizer(examples["sentences"], padding="max_length", max_length=32, truncation=True)

dataset = {"sentence_ids": ["123", "432", "443"],
           "labels": [3.0, 1.5, 1.0],
           "sentences": ["They stopped by infrequently",
                         "They are characteristically late",
                         "They live far away"]}

dataset = Dataset.from_dict(dataset)

tokenized_data = dataset.map(tokenize_function_testing)

And I was hoping I’d have access in tokenized_data to what you get out when you use tokens():

bert_tokenizer("They stopped by infrequently").tokens()
['[CLS]', 'they', 'stopped', 'by', 'in', '##fr', '##e', '##quent', '##ly', '[SEP]']

Is there any way to get this information without mapping the dataset a second time to get bert_tokenizer(examples["sentences"], padding="max_length", max_length=32, truncation=True).tokens() to a new column? I might not be thinking about this correctly so feel free to point out if I missed something obvious! Thanks!

You can update your function, to add both the encoded text, as well as the tokens:

def tokenize_function_testing(examples):
     encoding = bert_tokenizer(examples["sentences"], padding="max_length", max_length=32, truncation=True)
     # add tokens
     encoding["tokens"] = bert_tokenizer.convert_ids_to_tokens(encoding.input_ids)
     return encoding

Thanks! That’s just what I was looking for!