How did the dataset manages long sentences?

peune · February 2, 2022, 9:38am

Hello,

Given the following code:

tokenizer = GPT2Tokenizer.from_pretrained(xxx)

def encode(batch):
  return tokenizer(batch['text'], padding="max_length", add_special_tokens=True, truncation=True, max_length=64)

dataset = load_dataset("text", data_files = ['text.txt'])
dataset = dataset['train']
dataset.set_transform(encode)

I believe that each line in ‘text.txt’ will be pass through the ‘encode’ function that will apply the tokenizer then pad or truncate the result to exactly 64 tokens.
Does this means that if a line is 100 tokens long, the token number 65-100 will never be used in training?

If this is the case, then should I manually split long lines in ‘text.txt’ into multiple lines?

Thanks

lhoestq · February 15, 2022, 4:09pm

Hi ! That’s correct. You can use .map() to do so, see how to split long sentences for example

Topic		Replies	Views
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12731	October 6, 2021
How do you tokenize one long string? Beginners	0	293	June 24, 2023
Building a GPT2 dataset from long sequences 🤗Datasets	1	518	September 19, 2022
Make text data continuous from DatasetDict 🤗Datasets	1	1180	May 11, 2022
Training with varying lengths of sequences Beginners	0	1624	May 31, 2023

How did the dataset manages long sentences?

Related topics