How to add a new column to a dataset

chetnakhanna16 · May 29, 2021, 1:06am

How can I add a new column to my dataset?

I am working on Cosmos QA dataset and need to add a new column of the following format:
Value(dtype=‘string’, id=None)

The current dataset has the following features:
Dataset({
features: [‘id’, ‘context’, ‘question’, ‘answer0’, ‘answer1’, ‘answer2’, ‘answer3’, ‘label’],
num_rows: 25262
})

Thank you in advance!

lhoestq · June 3, 2021, 11:47am

Hi ! You can use the add_column method:

from datasets import load_dataset

ds = load_dataset("cosmos_qa", split="train")

new_column = ["foo"] * len(ds)
ds = ds.add_column("new_column", new_column)

and you get a dataset

Dataset({
    features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label', 'new_column'],
    num_rows: 25262
})

chetnakhanna16 · June 6, 2021, 7:03pm

Thank you for your reply!

Unfortunately, when I ran the above code, I got the following error:
AttributeError: ‘Dataset’ object has no attribute 'add_column’

Just FYI, Datasets version = 1.1.2

Thank you!

sgugger · June 7, 2021, 11:51am

You need to update your datasets library to the latest version for this.

mjc00 · March 16, 2022, 9:16pm

Can you provide more details on this? I am running into the same problem as @ chetnakhanna16

marshmellow77 · March 16, 2022, 9:24pm

On your machine you need to run the command pip install datasets --upgrade to update your dataset library to the latest version.

Hope that helps!

mjc00 · March 16, 2022, 9:41pm

Here is my current dataset version
datasets==2.0.0
Which is the same after running the suggested command. Still the problem persists.

In essence I am trying to find a way to add the input_ids and attention_mask back into my dataset.

The reason why I am doing this, is that when running this function:

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
def tokenize(batch):
    return tokenizer(batch["text"], 
                     padding=True, 
                     truncation=True)

on a sample of my dataset → print(tokenize(clean_dataset["train"][:2])) I get the following error
ValueError: text input must of type str(single example),List[str](batch or single pretokenized example) orList[List[str]] (batch of pretokenized examples).

However, when I run this complete_tok = tokenizer(list(x_complete), truncation=True, padding=True) where x_complete is an np array the tokenizer seems to run fine and creates input_ids and attention_mask

marshmellow77 · March 16, 2022, 9:52pm

Can you verify that clean_dataset["train"][:2] is indeed of type str?

mjc00 · March 16, 2022, 10:03pm

Found my mistake → needed to cast the clean_Description batch values to a list, code runs now. Thank you

def tokenize(batch):
    return tokenizer(list(batch["clean_Description"].values), 
                     padding=True, 
                     truncation=True)

crossdelenna · July 26, 2022, 4:31am

Hi guys, I am new here, just started using Those like me who are facing the same issue, I think the error is because of dataset type is ‘dict’ and that’s why it gives AttributeError: ‘Dataset’ object has no attribute 'add_column’

My dataset structure is Dataset({
train: [‘index’, ‘text’, ‘file’],
num_rows: 5000
})

So this is how I solved it

my_data[“train”] = my_data[“train”].add_column(“audio”,temp_data[‘audio’])

Hope I am not doing anything wrong.

DannyAI · August 10, 2023, 12:12pm

Should be a list, you could convert the temp_data[“audio”] to a list using

temp_data["audio].to_list()

hfreedma · October 3, 2023, 2:21pm

I had trouble with the add_column method, maybe it has been deprecated since this post?
However it is possible to create a Dataset directly from a Python dictionary using the Dataset.from_dict method. Using this it is possible to add a column to a Dataset by extracting existing Dataset columns into a Python dictionary, updating the dictionary with the desired column, then re-creating the Dataset object with the additional column(s).

Topic		Replies	Views
Add column with a particular type in datasets 🤗Datasets	2	375	July 5, 2024
Add new column to a dataset 🤗Datasets	8	4975	January 18, 2024
Adding data to empty dataset object 🤗Datasets	3	3475	February 10, 2022
How to add a new column with the type 'image' to an existing dataset? 🤗Datasets	1	467	October 13, 2023
Add_column() does not work if used on dataset sliced with select() 🤗Datasets	2	646	January 19, 2022

How to add a new column to a dataset

Related topics