Uploading 3D Numpy Array Dataset

I have a series of 3D seismic volumes (Numpy arrays) that I would like to upload as a dataset, but I see that .npy data type is not supported. Is there a work around to upload the 3D arrays to HF Dataset Hub?

Thanks!

How about converting it into parquet? (it’s recommended format Uploading datasets)

The following steps should be enough if you can use pandas:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet

or arrow:
https://arrow.apache.org/docs/python/parquet.html

Thanks for this @mahmutc

Here is the solution I came up with. My 3D seismic volumes are arrays with a shape of (300,300,1259). The code below converts to parquet files. I loop this over all seismic files in the training dataset to create .parquet versions.

def convert_to_parquet(array, file_name, folder):
    # Reshape the 3D array into a 2D array where each row represents a 2D slice of the original array
    reshaped_array = array.reshape(-1, array.shape[2])
    #column names need to be strings
    column_names = [f'{i}' for i in range(array.shape[2])]
    # Create a pandas DataFrame with the string-based column names
    df = pd.DataFrame(reshaped_array, columns=column_names)
    # Optionally, add 'Row' and 'Col' as string identifiers for the original 3D coordinates
    df['Row'] = np.repeat(np.arange(array.shape[0]), array.shape[1])
    df['Col'] = np.tile(np.arange(array.shape[1]), array.shape[0])
    # Reorder the columns to have 'Row' and 'Col' first
    df = df[['Row', 'Col'] + column_names]
    df.to_parquet(f'{folder}/{file_name}.parquet')
    return
1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.