I noticed that storing train and test dataset in csv/json and reloading it, not giving me the original dataset . In reloaded dataset, ner_tags feature isn’t instance of ClassLabel.
However, when I am saving the train/test dataset in arrow format and re-loading it. The reloaded dataset is same as original one with label feature being instance of ClassLabel.
I have modified the run_NER.py file to consume train/test/validation dataset in following way:
if data_args.dataset_name is not None:
# Downloading and loading a dataset from the hub.
raw_datasets = load_dataset(
data_args.dataset_name,
data_args.dataset_config_name,
cache_dir=model_args.cache_dir,
use_auth_token=True if model_args.use_auth_token else None,
)
if "train" in raw_datasets:
train_dataset = raw_datasets['train']
if "test" in raw_datasets:
test_dataset = raw_datasets['test']
if "validation" in raw_datasets:
validation_dataset = raw_datasets['validation']
else:
# data_files = {}
# if data_args.train_file is not None:
# data_files["train"] = data_args.train_file
# if data_args.validation_file is not None:
# data_files["validation"] = data_args.validation_file
# if data_args.test_file is not None:
# data_files["test"] = data_args.test_file
# extension = data_args.train_file.split(".")[-1]
# raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
train_dataset = load_from_disk(data_args.train_file)
test_dataset = load_from_disk(data_args.test_file)
if data_args.validation_file is not None:
validation_dataset = load_from_disk(data_args.validation_file)
in rest of the code of run_NER.py, i just replaced raw_datasets with appropriate train_dataset/test_dataset/validation_dataset.
Note: If anyone knows how to club the two dataset loaded from arrow format files, please feel free to drop the solution… Until then this hack works
-Thanks,
Dinesh