I am new to huggingface and I am trying to figure out if it is possible to get the the word piece tokens when mapping a dataset to a BERT tokenizer. This is what I have right now:
from transformers import AutoTokenizer
from datasets import Dataset
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
def tokenize_function_testing(examples):
return bert_tokenizer(examples["sentences"], padding="max_length", max_length=32, truncation=True)
dataset = {"sentence_ids": ["123", "432", "443"],
"labels": [3.0, 1.5, 1.0],
"sentences": ["They stopped by infrequently",
"They are characteristically late",
"They live far away"]}
dataset = Dataset.from_dict(dataset)
tokenized_data = dataset.map(tokenize_function_testing)
And I was hoping I’d have access in tokenized_data
to what you get out when you use tokens()
:
bert_tokenizer("They stopped by infrequently").tokens()
['[CLS]', 'they', 'stopped', 'by', 'in', '##fr', '##e', '##quent', '##ly', '[SEP]']
Is there any way to get this information without mapping the dataset a second time to get bert_tokenizer(examples["sentences"], padding="max_length", max_length=32, truncation=True).tokens()
to a new column? I might not be thinking about this correctly so feel free to point out if I missed something obvious! Thanks!