Add new column to a dataset

In the dataset I have 5000000 rows, I would like to add a column called ā€˜embeddingsā€™ to my dataset.

dataset = dataset.add_column('embeddings', embeddings)

The variable embeddings is a numpy memmap array of size (5000000, 512).

But I get this error:

ArrowInvalidTraceback (most recent call last)
in
----> 1 dataset = dataset.add_column(ā€˜embeddingsā€™, embeddings)

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
486 }
487 # apply actual function
ā†’ 488 out: Union[ā€œDatasetā€, ā€œDatasetDictā€] = func(self, *args, **kwargs)
489 datasets: List[ā€œDatasetā€] = list(out.values()) if isinstance(out, dict) else [out]
490 # re-apply format to the output

/opt/conda/lib/python3.8/site-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
404 # Call actual function
405
ā†’ 406 out = func(self, *args, **kwargs)
407
408 # Update fingerprint of in-place transforms + update in-place history of transforms

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in add_column(self, name, column, new_fingerprint)
3346 :class:Dataset
3347 ā€œā€"
ā†’ 3348 column_table = InMemoryTable.from_pydict({name: column})
3349 # Concatenate tables horizontally
3350 table = ConcatenationTable.from_tables([self._data, column_table], axis=1)

/opt/conda/lib/python3.8/site-packages/datasets/table.py in from_pydict(cls, *args, **kwargs)
367 @classmethod
368 def from_pydict(cls, *args, **kwargs):
ā†’ 369 return cls(pa.Table.from_pydict(*args, **kwargs))
370
371 @inject_arrow_table_documentation(pa.Table.from_batches)

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib._from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: only handle 1-dimensional arrays

How can I solve?

Hi,

it should work if you use concatenate_datasets instead:

import datasets
dset_embed = datasets.Dataset.from_dict({"embeddings": embeddings})
dset_concat = datasets.concatenate_datasets([dset, dset_embed], axis=1)
2 Likes

I have also the problem that the array ā€˜embeddingsā€™ does not fit the RAM, so I suspect that the method you are proposing is not actually feaseble.

Iā€™ve tried anyway and I got this error:

ArrowInvalidTraceback (most recent call last)
in
1 import datasets
----> 2 dataset_embed = datasets.Dataset.from_dict({ā€œembeddingsā€: embeddings})

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in from_dict(cls, mapping, features, info, split)
783 for col, data in mapping.items()
784 }
ā†’ 785 pa_table = InMemoryTable.from_pydict(mapping=mapping)
786 return cls(pa_table, info=info, split=split)
787

/opt/conda/lib/python3.8/site-packages/datasets/table.py in from_pydict(cls, *args, **kwargs)
367 @classmethod
368 def from_pydict(cls, *args, **kwargs):
ā†’ 369 return cls(pa.Table.from_pydict(*args, **kwargs))
370
371 @inject_arrow_table_documentation(pa.Table.from_batches)

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib._from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._handle_arrow_array_protocol()

/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py in arrow_array(self, type)
111 out = pa.ExtensionArray.from_storage(type, storage)
112 elif isinstance(self.data, np.ndarray):
ā†’ 113 out = numpy_to_pyarrow_listarray(self.data)
114 if type is not None:
115 out = out.cast(type)

/opt/conda/lib/python3.8/site-packages/datasets/features/features.py in numpy_to_pyarrow_listarray(arr, type)
921 n_offsets = reduce(mul, arr.shape[: arr.ndim - i - 1], 1)
922 step_offsets = arr.shape[arr.ndim - i - 1]
ā†’ 923 offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int32())
924 values = pa.ListArray.from_arrays(offsets, values)
925 return values

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Integer value 2147483648 not in range: -2147483648 to 2147483647

1 Like

Hi ! Not sure if datasets is able to convert a memory-mapped NumPy array to an Arrow array without bringing the NumPy array in RAM (I havenā€™t tested).

However the error youā€™re getting looks more like an integer precision issue because the array you are passing has more than 2147483648 elements, maybe you could try chunking your list of embeddings, then get a dataset object per chunk, and concatenate them to get the dataset with all the embeddings.

I ran into this issue and was able to address it using the chunking suggestion above. The following worked for me. Where I was previously just using convert_to_dataset I could now pass the same arguments to ChunkedDataset:

def convert_to_dataset(texts, labels, tokenizer, max_sequence_length):
    inputs = tokenizer(texts, padding="max_length", max_length=max_sequence_length, truncation=True)
    inputs['label'] = labels
    return Dataset.from_dict(inputs)

class ChunkedDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_sequence_length=512, chunk_size=20000):
        self.chunk_size = chunk_size
        self.datasets = []
        for i in tqdm(range(len(texts) // chunk_size + 1)):
            batch_slice = slice(i * chunk_size, (i + 1) * chunk_size)
            if len(texts[batch_slice]):
                self.datasets.append(convert_to_dataset(texts[batch_slice], labels[batch_slice], tokenizer, max_sequence_length))

    def __len__(self):
        return np.sum([len(x) for x in self.datasets])

    def __getitem__(self, idx):
        dataset_idx = idx // self.chunk_size
        idx = idx % self.chunk_size
        return self.datasets[dataset_idx][idx]

Ran into this problem too trying to do something like this:

datasets.Dataset.from_dict({"embeds": embeddings })

Really annoying limitation. I shouldnā€™t have to implement chunking myself.

Feel free to open an issue on github so we can investigate that :slight_smile:

But isnā€™t this trying to be solved by this old PR? https://github.com/huggingface/datasets/pull/4800

Ah yes itā€™s closely related indeed ! Thanks for bringing this PR up.

Iā€™ll try to find some time to dive again into this. IIRC it was mostly a matter of finding the appropriate API to switch from regular to large lists in Arrow