Error in Dataset Map Function

Hello all, I have a dataset object train_ds.

Output:

Dataset({
    features: ['filepath', 'class', 'fold'],
    num_rows: 6810
})

When I attempt to map using a preprocess function this works correctly:

def preprocess_function(examples):
      examples['newclass'] = examples['class']
      return examples

train_dataset = train_ds.map(
    preprocess_function,
    batch_size=100,
    batched=True,
    num_proc=4,
    load_from_cache_file=False
)
Dataset({
    features: ['filepath', 'class', 'fold', 'newclass'],
    num_rows: 6810
})

However, I cannot define any functions outside preprocess_function it seems, or something is bugging out.

def preprocess_function(examples):
    examples['audio'] = [torchaudio.load(path) for path in examples['filepath']]
    return examples

However I get the same error even if I define a function and attempt to use that function inside of preprocess_function. It’s as if the function “forgets” all other variables and functions in the notebook during the error.

def test_function(path):
    return path

#this next function should give me a new column with text filepaths named "audio"
def preprocess_function(examples):
    examples['audio'] = [test_function(path) for path in examples['filepath']]
    return examples

train_dataset = train_ds.map(
    preprocess_function,
    batch_size=100,
    batched=True,
    num_proc=4,
    load_from_cache_file=False
)
    769     return self._value
    770 else:
--> 771     raise self._value

NameError: name 'test_function' is not defined

Can someone help me identify what’s going on? I am using this as an example but working in VSCode:

UPDATE: I have narrowed it down to when num_proc > 1. Will search the forums now that I have more specific information. If I can fix it before, someone replies, I will post my solution for someone else.

It had to do with num_proc. Num_proc values greater than 1 was giving me errors. I changed it to 1 if I recall correctly.

Hi! If you are using Windows, you need to put multiprocessing calls inside if __name__ == "__main__". To learn why this is required, check this SO thread.

1 Like