I’m trying to make sure my script I’m hacking works from end-to-end, and waiting for epochs to end in training just takes up a bunch of time. I’ve shortened down the number of epochs and batch size to 1, but I’m guessing the data that I’m using is just too large so it takes a long time to go through batches.
I’m using some code from the GLUE example and it does the following:
dataset = datasets.load_dataset("glue", task)
I’d like to have it only take a set number of samples so I can iterate quicker.
Some things I’ve tried:
dataset = datasets.load_dataset(glue, g_task, split=split)
dataset = dataset[:20]|
This complains with:
KeyError: "Invalid key: slice(None, 20, None). Please first select a split. For example: `my_dataset_dictionary['train'][slice(None, 20, None)]`. Available splits: ['test', 'train', 'validation']"
Fair, so it’s a dictionary. I then try this:
dataset = datasets.load_dataset("glue", g_task, split=split)
for k, v in dataset.items():
dataset[k] = v[:20]
But then further on things blow up because I’m indexing unexpectedly:
Traceback (most recent call last):
File "D:\dev\valve\source2\main\src\devtools\k8s\dota\toxic-chat-ml\test\run_text_classification.py", line 157, in <module>
train_and_save()
File "D:\dev\valve\source2\main\src\devtools\k8s\dota\toxic-chat-ml\test\run_text_classification.py", line 55, in train_and_save
print(f"Sentence: {dataset['train'][0][sentence1_key]}")
KeyError: 0
Which makes sense - if I see the two versions of the dict before/after I slice it looks like I’m stomping some metadata with just the output of the training array.
I then see that there’s a “split” command I can issue to load_dataset that will let me do a slice, but it seems to only work if I request a specific ‘split’ (train/test) and won’t play nice with this dictionary based approach.
Am I missing an alternative here?
Thanks.
-e-