Custom dataset fails. "Please pass features or at least one example when writing data"

MarikLviv · March 9, 2025, 7:11pm

Hi, I am using VS Code AI toolkit, I generate default project template that renders dataset-classification.json as a training dataset. I want to change it to my custom dataset so I change olive-config.json “data_configs” section to following text:

“data_configs”: [
{
“name”: “dataset_default_train”,
“type”: “HuggingfaceContainer”,
“user_script”: “finetuning/qlora_user_script.py”,
“load_dataset_config”: {

        "data_name": "json", 
        "data_files": "dataset/chat-dataaset.json",
        "split": "train"
    },
    "pre_process_data_config": {
        "dataset_type": "corpus",
        "text_cols": [
            "INSTRUCTION",
            "RESPONSE"
          ],
        "text_template": "<|user|>\n{INSTRUCTION}<|end|>\n<|assistant|>\n{RESPONSE}<|end|>",
        "corpus_strategy": "join",
        "source_max_len": 1024,
        "pad_to_max_len": false,
        "use_attention_mask": false
    }
}

given that my dataset (the name of the file is correct) has following structure:
{“INSTRUCTION”: “Who is the deares of them all?”, “RESPONSE”: “Maria Smith”}
{“INSTRUCTION”: “Who has the nicest and most tender kiss in the world?”, “RESPONSE”: “Maria Smith the nice”}
{“INSTRUCTION”: “who is the best person on earth?”, “RESPONSE”: “Maria Smith”}

but when I run arrow_dataset.py - each time I get an error in builder.py: “Please pass features or at least one example when writing data”
when I change the config back to default one - everything works. I cant figure out what is wrong with this text_template

John6666 · March 10, 2025, 7:05am

Perhaps Olive issue?

github.com/microsoft/Olive

data_config: data_name is not None but olive always said it is None

opened 02:27AM - 10 Jun 24 UTC

closed 08:59AM - 21 Jun 24 UTC

Elizabeth819

**Describe the bug** File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/si…te-packages/olive/data/component/load_dataset.py", line 33, in huggingface_dataset assert data_name is not None, "Please specify the data name" AssertionError: Please specify the data name Even if i deleted the assert in source code, it is still wrong **To Reproduce** Steps to reproduce the behavior. **Expected behavior** A clear and concise description of what you expected to happen. **Olive config** "data_configs": [ { "name": "dataset_default_train", "type": "HuggingfaceContainer", "params_config": { "data_name": "json", "data_files":"../datasets/datasets.json", "split": "train", "component_kwargs": { "pre_process_data": { "dataset_type": "corpus", "text_cols": [ "INSTRUCTION", "RESPONSE", "SOURCE" ], "text_template": "<|user|>\n{INSTRUCTION}<|end|>\n<|assistant|>\n{RESPONSE}\n( source : {SOURCE})<|end|>", "corpus_strategy": "join", "source_max_len": 2048, "pad_to_max_len": false, "use_attention_mask": false } } } } ], **Olive logs** Add logs here. **Other information** - OS: ubuntu 20.04 - Olive version: 0.7.0 - ONNXRuntime package and version: [e.g. onnxruntime-gpu: 1.15.1] **Additional context** Add any other context about the problem here.

Topic		Replies	Views
Bug with datasets configs? 🤗Datasets	6	251	September 7, 2023
Passing schema features to a load_dataset function 🤗Datasets	4	1423	August 26, 2021
Custom dataset, wrong number of examples for one config 🤗Datasets	1	519	July 4, 2023
Question answering bot: fine-tuning with custom dataset Beginners	6	6031	June 23, 2022
Custom SQuAD2.0 dataset gives an error when using run_qa.py script 🤗Transformers	3	3426	July 30, 2021

Custom dataset fails. "Please pass features or at least one example when writing data"

Related topics