To provide lazily tokenize function, I extract the tokenize function to be a class:
> class LazyTokenize:
> def __init__(self, tokenizer, text_column_name, padding, max_seq_length):
> self.tokenizer = tokenizer
> self.text_column_name = text_column_name
> self.padding = padding
> self.max_seq_length = max_seq_length
>
> def tokenize_function(self, examples):
> # Remove empty lines
> examples[self.text_column_name] = [
> line for line in examples[self.text_column_name] if len(line) > 0 and not line.isspace()
> ]
> return self.tokenizer(
> examples[self.text_column_name],
> padding=self.padding,
> truncation=True,
> max_length=self.max_seq_length,
> return_special_tokens_mask=True,
> )
Then load dataset like below:
> data_files={'train': ['../data/pretrain/train\\file1.csv', '../data/pretrain/train\\file2.csv', '../data/pretrain/train\\file3.csv'], 'validation': ['../data/pretrain/eval\\file1.csv', '../data/pretrain/eval\\file2.csv', '../data/pretrain/eval\\file3.csv']}
> raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
> lazy_tokenizer = LazyTokenize(tokenizer, text_column_name, padding, max_seq_length)
> tokenized_datasets = raw_datasets.with_transform(lazy_tokenizer.tokenize_function)
Then use the trainer.py of transformer, the calling is like below:
> train_dataset = tokenized_datasets["train"]
> eval_dataset = tokenized_datasets["validation"]
> trainer = Trainer(
> model=model,
> args=training_args,
> train_dataset=train_dataset if training_args.do_train else None,
> eval_dataset=eval_dataset if training_args.do_eval else None,
> tokenizer=tokenizer,
> data_collator=data_collator,
> callbacks=[early_stopping_callback]
> )
>
> trainer.train(resume_from_checkpoint=checkpoint)
Before using the with_transform, the datasets.map function worked well with the tokenize function.
Thank you very much for your help!