Hierarchical data in datasets - seeking best practices

I am creating a first small dataset. I can explore it, but when I use load_dataset() I get an error (below). Before I get too carried away, I am wondering if I am on the right path.

The data consists of a screenplay (script) per episode of a short animated cartoon, plus existing created hierarchies of objects in a tool (Unity). E.g. there are “shots” which need a set of characters in the shot, lighting effects, camera position, exposure etc. I am exploring using the screenplay alone to see how much of the shot hierarchy can be created from text alone.

I have the contents as a deeply nested JSON object tree (episode / location / sequence / shot, then under shots characters, cameras, lights, props with animation tracks for positioning, facial expressions, body poses, etc). I have animation clips for “walk”, “sit” etc. so the goal is not full body animation - it is scene assembly from building blocks.

I was keeping the data hierarchical so I can expand at runtime in different ways. Shots are the most logical focus, but perhaps I need to provide information about the previous shots as well for context. “Sam sits down” for a shot needs to know where Sam was standing, which might have been described in the previous shot.

So question (1), should I flatten the data as much as possible inside the dataset, or is it okay to use pre-processing to flatten it so I can experiment doing it in different ways.

If hierarchy is okay, question (2) is the problem below. The documentation I have read says I just need to provide JSON files (not a loader script), but I am wondering with the deep hierarchical structure do I need to create a loader script?

Thank you!


The code I am using is:

from datasets import load_dataset
dataset = load_dataset("alankent/ordinary_screenplays", split="train")

The error I am getting is:

ValueError: Couldn't cast
episodeNumber: string
title: string
name: string
screenplay: list<item: struct<location: string, intro: list<item: null>, shots: list<item: struct<id: string, actions: list<item: struct<para: string, text: string>>, dialog: list<item: struct<speaker: string, lines: list<item: struct<line: string>>, mood: string>>>>>>
  child 0, item: struct<location: string, intro: list<item: null>, shots: list<item: struct<id: string, actions: list<item: struct<para: string, text: string>>, dialog: list<item: struct<speaker: string, lines: list<item: struct<line: string>>, mood: string>>>>>
      child 0, location: string
      child 1, intro: list<item: null>
          child 0, item: null
      child 2, shots: list<item: struct<id: string, actions: list<item: struct<para: string, text: string>>, dialog: list<item: struct<speaker: string, lines: list<item: struct<line: string>>, mood: string>>>>
          child 0, item: struct<id: string, actions: list<item: struct<para: string, text: string>>, dialog: list<item: struct<speaker: string, lines: list<item: struct<line: string>>, mood: string>>>
              child 0, id: string
              child 1, actions: list<item: struct<para: string, text: string>>
                  child 0, item: struct<para: string, text: string>
                      child 0, para: string
                      child 1, text: string
              child 2, dialog: list<item: struct<speaker: string, lines: list<item: struct<line: string>>, mood: string>>
                  child 0, item: struct<speaker: string, lines: list<item: struct<line: string>>, mood: string>
                      child 0, speaker: string
                      child 1, lines: list<item: struct<line: string>>
                          child 0, item: struct<line: string>
                              child 0, line: string
                      child 2, mood: string
to
{'episodeNumber': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'name': Value(dtype='string', id=None), 'scenes': [{'project': Value(dtype='string', id=None), 'scene': {'name': Value(dtype='string', id=None), 'parts': [{'name': Value(dtype='string', id=None), 'shots': [{'id': Value(dtype='string', id=None), 'name': Value(dtype='string', id=None), 'characters': [{'name': Value(dtype='string', id=None)}]}]}]}}], 'screenplay': [{'location': Value(dtype='string', id=None), 'intro': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None), 'shots': [{'id': Value(dtype='string', id=None), 'actions': [{'para': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None)}], 'dialog': [{'speaker': Value(dtype='string', id=None), 'lines': [{'line': Value(dtype='string', id=None)}], 'mood': Value(dtype='string', id=None)}]}]}]}
because column names don't match