Loading nested dataset for training

sourabhY · February 5, 2025, 1:00pm

I am trying to use allenai/pixmo-docs which has structure as

dataset_info:

config_name: charts
features:
- name: image
  dtype: image
- name: image_id
  dtype: string
- name: questions
  sequence:
  - name: question
    dtype: string
  - name: answer
    dtype: string
    and I am using this code and getting list indices must be integers/slices error and don’t know what to do. please help!!!

def preprocess_function(examples):
processed_inputs = {
‘input_ids’: ,
‘attention_mask’: ,
‘pixel_values’: ,
‘labels’:
}

for img, questions, answers in zip(examples['image'], examples['questions']['question'], examples['questions']['answer']):
    for q, a in zip(questions, answers):
        inputs = processor(images=img, text=q, padding="max_length", truncation=True, return_tensors="pt")
        
        processed_inputs['input_ids'].append(inputs['input_ids'][0])
        processed_inputs['attention_mask'].append(inputs['attention_mask'][0])
        processed_inputs['pixel_values'].append(inputs['pixel_values'][0])
        processed_inputs['labels'].append(a)

return processed_inputs

processed_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names)

John6666 · February 5, 2025, 1:54pm

It would be easy if this were the cause…

sourabhY · February 5, 2025, 2:08pm

Man thank you but when i run same script in colab i get different error.
Like remove column names train, validation not present found image… Something like this

John6666 · February 5, 2025, 2:42pm

There are actually errors that only occur in Colab, but the most common one is when the library version is different and the error content changes. Let’s try raising and lowering the library version in Colab. accelerate, transformers, datasets, and huggingface_hub have a significant impact.

sourabhY · February 5, 2025, 3:16pm

Can you tell how should i approach fine tuning and how to get started
Like i am stuck on loading dataset only from past 2 days.

John6666 · February 5, 2025, 3:30pm

I have some knowledge of trouble patterns, but I’m still a beginner when it comes to generative AI. I’m not suited to teaching because I don’t know much about the essence of training.

I think the best way to get started is to try out some courses.

Another option is to ask questions on HF Discord, which has more people than the forum. There are a lot of people on there who are knowledgeable about NLP. If you’re used to using Discord, it should be easy, but if you’re not, I’ll explain.

Topic		Replies	Views
Using External Datasets with HuggingFace Data Loader Beginners	9	4380	April 27, 2022
Loading Custom Datasets 🤗Datasets	7	10686	May 25, 2021
Fine Tuning IMDb tutorial - Unable to reproduce and adapt Beginners	19	8598	August 21, 2020
FileNotFoundError 🤗Datasets	1	196	May 18, 2024
Cannot user load_dataset in Google colab 🤗Datasets	6	1919	April 26, 2024

Loading nested dataset for training

Related topics