Access word piece tokens from BERT tokenized dataset

sombiatz · November 16, 2021, 11:23pm

I am new to huggingface and I am trying to figure out if it is possible to get the the word piece tokens when mapping a dataset to a BERT tokenizer. This is what I have right now:

from transformers import AutoTokenizer
from datasets import Dataset

bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function_testing(examples):
    return bert_tokenizer(examples["sentences"], padding="max_length", max_length=32, truncation=True)

dataset = {"sentence_ids": ["123", "432", "443"],
           "labels": [3.0, 1.5, 1.0],
           "sentences": ["They stopped by infrequently",
                         "They are characteristically late",
                         "They live far away"]}

dataset = Dataset.from_dict(dataset)

tokenized_data = dataset.map(tokenize_function_testing)

And I was hoping I’d have access in tokenized_data to what you get out when you use tokens():

bert_tokenizer("They stopped by infrequently").tokens()
['[CLS]', 'they', 'stopped', 'by', 'in', '##fr', '##e', '##quent', '##ly', '[SEP]']

Is there any way to get this information without mapping the dataset a second time to get bert_tokenizer(examples["sentences"], padding="max_length", max_length=32, truncation=True).tokens() to a new column? I might not be thinking about this correctly so feel free to point out if I missed something obvious! Thanks!

nielsr · November 17, 2021, 9:47am

You can update your function, to add both the encoded text, as well as the tokens:

def tokenize_function_testing(examples):
     encoding = bert_tokenizer(examples["sentences"], padding="max_length", max_length=32, truncation=True)
     # add tokens
     encoding["tokens"] = bert_tokenizer.convert_ids_to_tokens(encoding.input_ids)
     return encoding

sombiatz · November 17, 2021, 2:59pm

Thanks! That’s just what I was looking for!

Topic		Replies	Views
BERT embeddings on big dataset 🤗Datasets	3	124	August 28, 2024
Do I have to only tokens in Bert dataset for token classification 🤗Datasets	0	131	January 18, 2024
Tunning tokenizer on my own dataset 🤗Tokenizers	0	718	January 25, 2021
Multi-input tag and ,multi-label output for token classification using Bert pretrained model 🤗Transformers	1	86	January 9, 2025
How to create a Huggingface tokenizer from a non-Huggingface tokenizer? 🤗Tokenizers	0	522	May 4, 2021

Access word piece tokens from BERT tokenized dataset

Related topics