Hi !
If you want to tokenize line by line, you can use this:
max_seq_length = 512
num_proc = 4
def tokenize_function(examples):
# Remove empty lines
examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()]
return tokenizer(
examples["text"],
truncation=True,
max_length=max_seq_length,
)
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
num_proc=num_proc,
remove_columns=["text"],
)
Though the TextDataset
was doing a different processing by concatenating all the texts and building blocks of size 512. If you need this behavior, then you must apply an additional map
function after the tokenization:
# Main data processing function that will concatenate all texts from
# our dataset and generate chunks of max_seq_length.
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop,
# you can customize this part to your needs.
total_length = (total_length // max_seq_length) * max_seq_length
# Split by chunks of max_len.
result = {
k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
for k, t in concatenated_examples.items()
}
return result
# Note that with `batched=True`, this map processes 1,000 texts together,
# so group_texts throws away a remainder for each of those groups of 1,000 texts.
# You can adjust that batch_size here but a higher value might be slower to preprocess.
tokenized_dataset = tokenized_dataset.map(
group_texts,
batched=True,
num_proc=num_proc,
)
This code comes from the processing of the run_mlm.py example script of transformers