Help understanding how to build a dataset for language as with the old TextDataset

Hello,

I am trying to load a custom dataset that I will then use for language modeling. The dataset consists of a text file that has a whole document in each line, meaning that each line overpasses the normal 512 tokens limit of most tokenizers.

I would like to understand what is the process to build a text dataset that tokenizes each line, having previously split the documents in the dataset into lines of a ‚Äútokenizable‚ÄĚ size, as the old TextDataset class would do, where you only had to do the following, and a tokenized dataset without text loss would be available to pass to a DataCollator:

model_checkpoint = 'distilbert-base-uncased'

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

from transformers import TextDataset

dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="path/to/text_file.txt",
    block_size=512,
)

For now, what I have is the following, which, of course, throws an error/warning because each line is longer than the maximum block size in the tokenizer:

import datasets
dataset = datasets.load_dataset('path/to/text_file.txt')

model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

tokenized_datasets

So what would be the ‚Äústandard‚ÄĚ way of creating a dataset in the way it was done before?

Thank you very much for the help :))

Hi !

If you want to tokenize line by line, you can use this:

max_seq_length = 512
num_proc = 4

def tokenize_function(examples):
    # Remove empty lines
    examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()]
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=max_seq_length,
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=num_proc,
    remove_columns=["text"],
)

Though the TextDataset was doing a different processing by concatenating all the texts and building blocks of size 512. If you need this behavior, then you must apply an additional map function after the tokenization:

# Main data processing function that will concatenate all texts from
# our dataset and generate chunks of max_seq_length.
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop,
    # you can customize this part to your needs.
    total_length = (total_length // max_seq_length) * max_seq_length
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
        for k, t in concatenated_examples.items()
    }
    return result

# Note that with `batched=True`, this map processes 1,000 texts together,
# so group_texts throws away a remainder for each of those groups of 1,000 texts.
# You can adjust that batch_size here but a higher value might be slower to preprocess.

tokenized_dataset = tokenized_dataset.map(
    group_texts,
    batched=True,
    num_proc=num_proc,
)

This code comes from the processing of the run_mlm.py example script of transformers

Thanks, this was what I was looking for!! :hugs:

I managed to find this code in this example, but I was not sure how to adapt it.

Thank you for this @lhoestq. Could you please explain what is the benefit of doing this -

#We drop the small remainder, we could add padding if the model supported it instead of this drop,
#you can customize this part to your needs.
total_length = (total_length // max_seq_length) * max_seq_length

Why not just skip that statement ? Thank you.

Hi ! This statement makes all your input samples have the same length equal to max_seq_length.
It crops the end of each batch, otherwise you end up with a sample smaller than max_seq_length.

So you can remove this statement but you you may need to apply padding to the last sample in this case to make it have a length of max_seq_length

1 Like

hi @lhoestq , I know this is an old thread, but I have a follow-up question. If you tokenize and then group as suggested above, this will mean that some bos tokens and eos tokens will be in the middle of the input_ids sequence. For example, if the max length is 128 and you combine a 100 token sequence with the next 28 tokens of the following sequence, then element 100 (if 0 indexing) will be a bos token.

Is this problematic? Does MLM require a bos token at the beginning or eos at the end? Does it need any special tokens?

Hi ! Indeed there are BOS and EOS tokens in the middle of the tokenization, but we mask them using the special_tokens_mask passed to DataCollatorForLanguageModeling.

we mask them using the special_tokens_mask passed to DataCollatorForLanguageModeling .

Ah that makes sense.

Is it problematic that there will also be some sequences that do not start with a BOS token or end in an EOS token?

Edit: I think this is a bad question because it looks like all special tokens are masked, regardless of location.

New question: Does MLM not require special tokens at all? If so, why not tokenize without them?

Edit 2: It looks like the special_tokens_mask prevents special tokens from being replaced with a [MASK] token, but I don’t think they are masked (using the attention_mask) when they are used as inputs to the model. Am I wrong?