How did the dataset manages long sentences?


Given the following code:

tokenizer = GPT2Tokenizer.from_pretrained(xxx)

def encode(batch):
  return tokenizer(batch['text'], padding="max_length", add_special_tokens=True, truncation=True, max_length=64)

dataset = load_dataset("text", data_files = ['text.txt'])
dataset = dataset['train']

I believe that each line in ‘text.txt’ will be pass through the ‘encode’ function that will apply the tokenizer then pad or truncate the result to exactly 64 tokens.
Does this means that if a line is 100 tokens long, the token number 65-100 will never be used in training?

If this is the case, then should I manually split long lines in ‘text.txt’ into multiple lines?


Hi ! That’s correct. You can use .map() to do so, see how to split long sentences for example

1 Like