Help understanding how to build a dataset for language as with the old TextDataset

Hello,

I am trying to load a custom dataset that I will then use for language modeling. The dataset consists of a text file that has a whole document in each line, meaning that each line overpasses the normal 512 tokens limit of most tokenizers.

I would like to understand what is the process to build a text dataset that tokenizes each line, having previously split the documents in the dataset into lines of a ‚Äútokenizable‚ÄĚ size, as the old TextDataset class would do, where you only had to do the following, and a tokenized dataset without text loss would be available to pass to a DataCollator:

model_checkpoint = 'distilbert-base-uncased'

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

from transformers import TextDataset

dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="path/to/text_file.txt",
    block_size=512,
)

For now, what I have is the following, which, of course, throws an error/warning because each line is longer than the maximum block size in the tokenizer:

import datasets
dataset = datasets.load_dataset('path/to/text_file.txt')

model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

tokenized_datasets

So what would be the ‚Äústandard‚ÄĚ way of creating a dataset in the way it was done before?

Thank you very much for the help :))