How to add a new column to a dataset

How can I add a new column to my dataset?

I am working on Cosmos QA dataset and need to add a new column of the following format:
Value(dtype=‘string’, id=None)

The current dataset has the following features:
features: [‘id’, ‘context’, ‘question’, ‘answer0’, ‘answer1’, ‘answer2’, ‘answer3’, ‘label’],
num_rows: 25262

Thank you in advance!

Hi ! You can use the add_column method:

from datasets import load_dataset

ds = load_dataset("cosmos_qa", split="train")

new_column = ["foo"] * len(ds)
ds = ds.add_column("new_column", new_column)

and you get a dataset

    features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label', 'new_column'],
    num_rows: 25262

Thank you for your reply!

Unfortunately, when I ran the above code, I got the following error:
AttributeError: ‘Dataset’ object has no attribute 'add_column’

Just FYI, Datasets version = 1.1.2

Thank you!

You need to update your datasets library to the latest version for this.

Can you provide more details on this? I am running into the same problem as @ chetnakhanna16

On your machine you need to run the command pip install datasets --upgrade to update your dataset library to the latest version.

Hope that helps!

Here is my current dataset version
Which is the same after running the suggested command. Still the problem persists.

In essence I am trying to find a way to add the input_ids and attention_mask back into my dataset.

The reason why I am doing this, is that when running this function:

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
def tokenize(batch):
    return tokenizer(batch["text"], 

on a sample of my dataset → print(tokenize(clean_dataset["train"][:2])) I get the following error
ValueError: text input must of type str(single example),List[str](batch or single pretokenized example) orList[List[str]] (batch of pretokenized examples).

However, when I run this complete_tok = tokenizer(list(x_complete), truncation=True, padding=True) where x_complete is an np array the tokenizer seems to run fine and creates input_ids and attention_mask

Can you verify that clean_dataset["train"][:2] is indeed of type str?

Found my mistake → needed to cast the clean_Description batch values to a list, code runs now. Thank you :disguised_face: :nerd_face:

def tokenize(batch):
    return tokenizer(list(batch["clean_Description"].values), 
Hi guys, I am new here, just started using :hugs: Those like me who are facing the same issue, I think the error is because of dataset type is ‘dict’ and that’s why it gives AttributeError: ‘Dataset’ object has no attribute 'add_column’

My dataset structure is Dataset({
train: [‘index’, ‘text’, ‘file’],
num_rows: 5000

So this is how I solved it

my_data[“train”] = my_data[“train”].add_column(“audio”,temp_data[‘audio’])

Hope I am not doing anything wrong.