Add new column to a dataset

In the dataset I have 5000000 rows, I would like to add a column called ā€˜embeddings’ to my dataset.

dataset = dataset.add_column('embeddings', embeddings)

The variable embeddings is a numpy memmap array of size (5000000, 512).

But I get this error:

ArrowInvalidTraceback (most recent call last)
in
----> 1 dataset = dataset.add_column(ā€˜embeddings’, embeddings)

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
486 }
487 # apply actual function
→ 488 out: Union[ā€œDatasetā€, ā€œDatasetDictā€] = func(self, *args, **kwargs)
489 datasets: List[ā€œDatasetā€] = list(out.values()) if isinstance(out, dict) else [out]
490 # re-apply format to the output

/opt/conda/lib/python3.8/site-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
404 # Call actual function
405
→ 406 out = func(self, *args, **kwargs)
407
408 # Update fingerprint of in-place transforms + update in-place history of transforms

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in add_column(self, name, column, new_fingerprint)
3346 :class:Dataset
3347 ā€œā€"
→ 3348 column_table = InMemoryTable.from_pydict({name: column})
3349 # Concatenate tables horizontally
3350 table = ConcatenationTable.from_tables([self._data, column_table], axis=1)

/opt/conda/lib/python3.8/site-packages/datasets/table.py in from_pydict(cls, *args, **kwargs)
367 @classmethod
368 def from_pydict(cls, *args, **kwargs):
→ 369 return cls(pa.Table.from_pydict(*args, **kwargs))
370
371 @inject_arrow_table_documentation(pa.Table.from_batches)

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib._from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: only handle 1-dimensional arrays

How can I solve?

Hi,

it should work if you use concatenate_datasets instead:

import datasets
dset_embed = datasets.Dataset.from_dict({"embeddings": embeddings})
dset_concat = datasets.concatenate_datasets([dset, dset_embed], axis=1)
2 Likes

I have also the problem that the array ā€˜embeddings’ does not fit the RAM, so I suspect that the method you are proposing is not actually feaseble.

I’ve tried anyway and I got this error:

ArrowInvalidTraceback (most recent call last)
in
1 import datasets
----> 2 dataset_embed = datasets.Dataset.from_dict({ā€œembeddingsā€: embeddings})

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in from_dict(cls, mapping, features, info, split)
783 for col, data in mapping.items()
784 }
→ 785 pa_table = InMemoryTable.from_pydict(mapping=mapping)
786 return cls(pa_table, info=info, split=split)
787

/opt/conda/lib/python3.8/site-packages/datasets/table.py in from_pydict(cls, *args, **kwargs)
367 @classmethod
368 def from_pydict(cls, *args, **kwargs):
→ 369 return cls(pa.Table.from_pydict(*args, **kwargs))
370
371 @inject_arrow_table_documentation(pa.Table.from_batches)

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib._from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._handle_arrow_array_protocol()

/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py in arrow_array(self, type)
111 out = pa.ExtensionArray.from_storage(type, storage)
112 elif isinstance(self.data, np.ndarray):
→ 113 out = numpy_to_pyarrow_listarray(self.data)
114 if type is not None:
115 out = out.cast(type)

/opt/conda/lib/python3.8/site-packages/datasets/features/features.py in numpy_to_pyarrow_listarray(arr, type)
921 n_offsets = reduce(mul, arr.shape[: arr.ndim - i - 1], 1)
922 step_offsets = arr.shape[arr.ndim - i - 1]
→ 923 offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int32())
924 values = pa.ListArray.from_arrays(offsets, values)
925 return values

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Integer value 2147483648 not in range: -2147483648 to 2147483647

1 Like

Hi ! Not sure if datasets is able to convert a memory-mapped NumPy array to an Arrow array without bringing the NumPy array in RAM (I haven’t tested).

However the error you’re getting looks more like an integer precision issue because the array you are passing has more than 2147483648 elements, maybe you could try chunking your list of embeddings, then get a dataset object per chunk, and concatenate them to get the dataset with all the embeddings.

I ran into this issue and was able to address it using the chunking suggestion above. The following worked for me. Where I was previously just using convert_to_dataset I could now pass the same arguments to ChunkedDataset:

def convert_to_dataset(texts, labels, tokenizer, max_sequence_length):
    inputs = tokenizer(texts, padding="max_length", max_length=max_sequence_length, truncation=True)
    inputs['label'] = labels
    return Dataset.from_dict(inputs)

class ChunkedDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_sequence_length=512, chunk_size=20000):
        self.chunk_size = chunk_size
        self.datasets = []
        for i in tqdm(range(len(texts) // chunk_size + 1)):
            batch_slice = slice(i * chunk_size, (i + 1) * chunk_size)
            if len(texts[batch_slice]):
                self.datasets.append(convert_to_dataset(texts[batch_slice], labels[batch_slice], tokenizer, max_sequence_length))

    def __len__(self):
        return np.sum([len(x) for x in self.datasets])

    def __getitem__(self, idx):
        dataset_idx = idx // self.chunk_size
        idx = idx % self.chunk_size
        return self.datasets[dataset_idx][idx]

Ran into this problem too trying to do something like this:

datasets.Dataset.from_dict({"embeds": embeddings })

Really annoying limitation. I shouldn’t have to implement chunking myself.

Feel free to open an issue on github so we can investigate that :slight_smile:

But isn’t this trying to be solved by this old PR? https://github.com/huggingface/datasets/pull/4800

Ah yes it’s closely related indeed ! Thanks for bringing this PR up.

I’ll try to find some time to dive again into this. IIRC it was mostly a matter of finding the appropriate API to switch from regular to large lists in Arrow