How do you tokenize one long string?

DiggingLace · June 24, 2023, 4:53am

Let’s say my training dataset is just one super long string. What is the correct way to tokenize this?

I have this so far:

trainenc = tokenizer(train_dataset['text'], return_tensors='pt', max_length=128, truncation=True, padding=True, return_overflowing_tokens=True)

What arguments should i keep? Afterwards, how do I split up my long list of tokens into batches where each element of the batch is short enough to fit inside the model?

Thanks

Topic		Replies	Views
How did the dataset manages long sentences? 🤗Datasets	1	985	February 15, 2022
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12718	October 6, 2021
Training with varying lengths of sequences Beginners	0	1613	May 31, 2023
Training a Tokenizer on a Streamed Dataset Beginners	5	1341	May 30, 2023
How to tokenize using map 🤗Datasets	4	6189	April 14, 2021

How do you tokenize one long string?

Related topics