Hello,
I am working on an implementation of a streamed dataset that consists of input examples that are concatenated together and then split into sequences of exactly 2048 tokens so that there are no padding tokens. Examples can be split in the middle. I use drop_last=True in the DataLoader to remove the last input example which does not meet the required sequence length. I am using .map() to apply the processing function to the input examples which meet the criterion above.
Is this the correct method of doing so? When training the model while using streaming=True there seems to be some instability with spiking training loss. When using the same data loading process but loading the entire dataset into memory, that instability disappears and the training loss becomes smooth. Is anyone able to provide any additional advice? Information on improving the data loading and processing method?
This is the code I have for processing, tokenizing, and loading the data into the DataLoader:
tokenizer = GPT2Tokenizer(vocab_file='/token/vocab.json',
merges_file='/token/merges.txt')
ds = load_dataset("lvwerra/github-code",
streaming=True,
split="train",
languages=["Python"])
shuffled_dataset = ds.shuffle(seed=42,
buffer_size=10_000)
def tokenize(examples):
seq_length = 2048
examples = tokenizer(examples["code"])
concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
if total_length >= seq_length:
total_length = (total_length // seq_length) * seq_length
result = {
k: [t[i : i + seq_length] for i in range(0, total_length, seq_length)]
for k, t in concatenated_examples.items()
}
return result
tokenized_dataset = shuffled_dataset.map(
tokenize,
batched=True,
remove_columns= ['code', 'language', 'path', 'repo_name', 'size', 'license']
)
dataset = tokenized_dataset.with_format("torch")
dataloader = DataLoader(dataset,
drop_last=True,
collate_fn=default_data_collator,
batch_size=8,
)