Loading nested dataset for training

I am trying to use allenai/pixmo-docs which has structure as

dataset_info:

  • config_name: charts
    features:
    • name: image
      dtype: image
    • name: image_id
      dtype: string
    • name: questions
      sequence:
      • name: question
        dtype: string
      • name: answer
        dtype: string
        and I am using this code and getting list indices must be integers/slices error and donā€™t know what to do. please help!!!

def preprocess_function(examples):
processed_inputs = {
ā€˜input_idsā€™: ,
ā€˜attention_maskā€™: ,
ā€˜pixel_valuesā€™: ,
ā€˜labelsā€™:
}

for img, questions, answers in zip(examples['image'], examples['questions']['question'], examples['questions']['answer']):
    for q, a in zip(questions, answers):
        inputs = processor(images=img, text=q, padding="max_length", truncation=True, return_tensors="pt")
        
        processed_inputs['input_ids'].append(inputs['input_ids'][0])
        processed_inputs['attention_mask'].append(inputs['attention_mask'][0])
        processed_inputs['pixel_values'].append(inputs['pixel_values'][0])
        processed_inputs['labels'].append(a)

return processed_inputs

processed_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names)

1 Like

It would be easy if this were the causeā€¦

Man thank you but when i run same script in colab i get different error.
Like remove column names train, validation not present found imageā€¦ Something like this

1 Like

There are actually errors that only occur in Colab, but the most common one is when the library version is different and the error content changes. Letā€™s try raising and lowering the library version in Colab. accelerate, transformers, datasets, and huggingface_hub have a significant impact.

Can you tell how should i approach fine tuning and how to get started
Like i am stuck on loading dataset only from past 2 days.

1 Like

I have some knowledge of trouble patterns, but Iā€™m still a beginner when it comes to generative AI. Iā€™m not suited to teaching because I donā€™t know much about the essence of training.

I think the best way to get started is to try out some courses.

Another option is to ask questions on HF Discord, which has more people than the forum. There are a lot of people on there who are knowledgeable about NLP. If youā€™re used to using Discord, it should be easy, but if youā€™re not, Iā€™ll explain.