Add new column to a dataset

luka · November 22, 2021, 10:54am

In the dataset I have 5000000 rows, I would like to add a column called ‘embeddings’ to my dataset.

dataset = dataset.add_column('embeddings', embeddings)

The variable embeddings is a numpy memmap array of size (5000000, 512).

But I get this error:

ArrowInvalidTraceback (most recent call last)
in
----> 1 dataset = dataset.add_column(‘embeddings’, embeddings)

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
486 }
487 # apply actual function
→ 488 out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
489 datasets: List[“Dataset”] = list(out.values()) if isinstance(out, dict) else [out]
490 # re-apply format to the output

/opt/conda/lib/python3.8/site-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
404 # Call actual function
405
→ 406 out = func(self, *args, **kwargs)
407
408 # Update fingerprint of in-place transforms + update in-place history of transforms

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in add_column(self, name, column, new_fingerprint)
3346 :class:Dataset
3347 “”"
→ 3348 column_table = InMemoryTable.from_pydict({name: column})
3349 # Concatenate tables horizontally
3350 table = ConcatenationTable.from_tables([self._data, column_table], axis=1)

/opt/conda/lib/python3.8/site-packages/datasets/table.py in from_pydict(cls, *args, **kwargs)
367 @classmethod
368 def from_pydict(cls, *args, **kwargs):
→ 369 return cls(pa.Table.from_pydict(*args, **kwargs))
370
371 @inject_arrow_table_documentation(pa.Table.from_batches)

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib._from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: only handle 1-dimensional arrays

How can I solve?

mariosasko · November 22, 2021, 11:31am

Hi,

it should work if you use concatenate_datasets instead:

import datasets
dset_embed = datasets.Dataset.from_dict({"embeddings": embeddings})
dset_concat = datasets.concatenate_datasets([dset, dset_embed], axis=1)

luka · November 22, 2021, 12:00pm

I have also the problem that the array ‘embeddings’ does not fit the RAM, so I suspect that the method you are proposing is not actually feaseble.

I’ve tried anyway and I got this error:

ArrowInvalidTraceback (most recent call last)
in
1 import datasets
----> 2 dataset_embed = datasets.Dataset.from_dict({“embeddings”: embeddings})

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in from_dict(cls, mapping, features, info, split)
783 for col, data in mapping.items()
784 }
→ 785 pa_table = InMemoryTable.from_pydict(mapping=mapping)
786 return cls(pa_table, info=info, split=split)
787

/opt/conda/lib/python3.8/site-packages/datasets/table.py in from_pydict(cls, *args, **kwargs)
367 @classmethod
368 def from_pydict(cls, *args, **kwargs):
→ 369 return cls(pa.Table.from_pydict(*args, **kwargs))
370
371 @inject_arrow_table_documentation(pa.Table.from_batches)

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib._from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._handle_arrow_array_protocol()

/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py in arrow_array(self, type)
111 out = pa.ExtensionArray.from_storage(type, storage)
112 elif isinstance(self.data, np.ndarray):
→ 113 out = numpy_to_pyarrow_listarray(self.data)
114 if type is not None:
115 out = out.cast(type)

/opt/conda/lib/python3.8/site-packages/datasets/features/features.py in numpy_to_pyarrow_listarray(arr, type)
921 n_offsets = reduce(mul, arr.shape[: arr.ndim - i - 1], 1)
922 step_offsets = arr.shape[arr.ndim - i - 1]
→ 923 offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int32())
924 values = pa.ListArray.from_arrays(offsets, values)
925 return values

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Integer value 2147483648 not in range: -2147483648 to 2147483647

lhoestq · November 25, 2021, 4:31pm

Hi ! Not sure if datasets is able to convert a memory-mapped NumPy array to an Arrow array without bringing the NumPy array in RAM (I haven’t tested).

However the error you’re getting looks more like an integer precision issue because the array you are passing has more than 2147483648 elements, maybe you could try chunking your list of embeddings, then get a dataset object per chunk, and concatenate them to get the dataset with all the embeddings.

nadahlberg · September 5, 2022, 2:51pm

I ran into this issue and was able to address it using the chunking suggestion above. The following worked for me. Where I was previously just using convert_to_dataset I could now pass the same arguments to ChunkedDataset:

def convert_to_dataset(texts, labels, tokenizer, max_sequence_length):
    inputs = tokenizer(texts, padding="max_length", max_length=max_sequence_length, truncation=True)
    inputs['label'] = labels
    return Dataset.from_dict(inputs)

class ChunkedDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_sequence_length=512, chunk_size=20000):
        self.chunk_size = chunk_size
        self.datasets = []
        for i in tqdm(range(len(texts) // chunk_size + 1)):
            batch_slice = slice(i * chunk_size, (i + 1) * chunk_size)
            if len(texts[batch_slice]):
                self.datasets.append(convert_to_dataset(texts[batch_slice], labels[batch_slice], tokenizer, max_sequence_length))

    def __len__(self):
        return np.sum([len(x) for x in self.datasets])

    def __getitem__(self, idx):
        dataset_idx = idx // self.chunk_size
        idx = idx % self.chunk_size
        return self.datasets[dataset_idx][idx]

jxm · September 3, 2023, 10:50pm

Ran into this problem too trying to do something like this:

datasets.Dataset.from_dict({"embeds": embeddings })

Really annoying limitation. I shouldn’t have to implement chunking myself.

lhoestq · September 4, 2023, 8:55am

Feel free to open an issue on github so we can investigate that

nishaanthkanna · January 15, 2024, 3:46pm

But isn’t this trying to be solved by this old PR? https://github.com/huggingface/datasets/pull/4800

lhoestq · January 18, 2024, 12:17pm

Ah yes it’s closely related indeed ! Thanks for bringing this PR up.

I’ll try to find some time to dive again into this. IIRC it was mostly a matter of finding the appropriate API to switch from regular to large lists in Arrow

Topic		Replies	Views
Compatibility for numpy arrays 🤗Datasets	7	5589	April 27, 2021
Proprietary database load error: TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray) 🤗Datasets	2	1144	January 25, 2022
Add new feature without changing number or rows 🤗Datasets	2	803	May 18, 2022
Unsupported numpy type when calling ds.add_column() 🤗Datasets	1	463	October 7, 2022
.get_nearest_examples() throws ArrowInvalid: offset overflow while concatenating arrays 🤗Datasets	4	3066	September 30, 2020

Add new column to a dataset

Related topics