Token Classification run_NER.py AttributeError

dineshmane · July 8, 2022, 8:26pm

I noticed that storing train and test dataset in csv/json and reloading it, not giving me the original dataset . In reloaded dataset, ner_tags feature isn’t instance of ClassLabel.
However, when I am saving the train/test dataset in arrow format and re-loading it. The reloaded dataset is same as original one with label feature being instance of ClassLabel.

I have modified the run_NER.py file to consume train/test/validation dataset in following way:

 if data_args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        raw_datasets = load_dataset(
            data_args.dataset_name,
            data_args.dataset_config_name,
            cache_dir=model_args.cache_dir,
            use_auth_token=True if model_args.use_auth_token else None,
        )
        if "train" in raw_datasets:
            train_dataset = raw_datasets['train']
        if "test" in raw_datasets:
            test_dataset = raw_datasets['test']
        if "validation" in raw_datasets:
            validation_dataset = raw_datasets['validation']
    else:
        # data_files = {}
        # if data_args.train_file is not None:
        #     data_files["train"] = data_args.train_file
        # if data_args.validation_file is not None:
        #     data_files["validation"] = data_args.validation_file
        # if data_args.test_file is not None:
        #     data_files["test"] = data_args.test_file
        # extension = data_args.train_file.split(".")[-1]
        # raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)

        train_dataset = load_from_disk(data_args.train_file)
        test_dataset = load_from_disk(data_args.test_file)
        if data_args.validation_file is not None:
            validation_dataset = load_from_disk(data_args.validation_file)

in rest of the code of run_NER.py, i just replaced raw_datasets with appropriate train_dataset/test_dataset/validation_dataset.

Note: If anyone knows how to club the two dataset loaded from arrow format files, please feel free to drop the solution… Until then this hack works

-Thanks,
Dinesh

Topic		Replies	Views
Loading Custom Datasets 🤗Datasets	7	10708	May 25, 2021
Changing ClassLabels for NER Beginners	3	539	November 13, 2023
ValueError: Field 'ner_tags' from the JSON data of type list<item: string> is not compatible with ClassLabel. Compatible types are int64 and string 🤗Datasets	7	862	March 25, 2022
Custom files for run_ner.py Beginners	0	541	August 11, 2021
What is the data file format of `run_ner.py`? 🤗Transformers	2	330	April 4, 2024

Token Classification run_NER.py AttributeError

Related topics