How to add a new column to a dataset

How can I add a new column to my dataset?

I am working on Cosmos QA dataset and need to add a new column of the following format:
Value(dtype=‘string’, id=None)

The current dataset has the following features:
features: [‘id’, ‘context’, ‘question’, ‘answer0’, ‘answer1’, ‘answer2’, ‘answer3’, ‘label’],
num_rows: 25262

Thank you in advance!

Hi ! You can use the add_column method:

from datasets import load_dataset

ds = load_dataset("cosmos_qa", split="train")

new_column = ["foo"] * len(ds)
ds = ds.add_column("new_column", new_column)

and you get a dataset

    features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label', 'new_column'],
    num_rows: 25262

Thank you for your reply!

Unfortunately, when I ran the above code, I got the following error:
AttributeError: ‘Dataset’ object has no attribute 'add_column’

Just FYI, Datasets version = 1.1.2

Thank you!

You need to update your datasets library to the latest version for this.

1 Like

Can you provide more details on this? I am running into the same problem as @ chetnakhanna16

On your machine you need to run the command pip install datasets --upgrade to update your dataset library to the latest version.

Hope that helps!

1 Like

Here is my current dataset version
Which is the same after running the suggested command. Still the problem persists.

In essence I am trying to find a way to add the input_ids and attention_mask back into my dataset.

The reason why I am doing this, is that when running this function:

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
def tokenize(batch):
    return tokenizer(batch["text"], 

on a sample of my dataset → print(tokenize(clean_dataset["train"][:2])) I get the following error
ValueError: text input must of type str(single example),List[str](batch or single pretokenized example) orList[List[str]] (batch of pretokenized examples).

However, when I run this complete_tok = tokenizer(list(x_complete), truncation=True, padding=True) where x_complete is an np array the tokenizer seems to run fine and creates input_ids and attention_mask

Can you verify that clean_dataset["train"][:2] is indeed of type str?

1 Like

Found my mistake → needed to cast the clean_Description batch values to a list, code runs now. Thank you :disguised_face: :nerd_face:

def tokenize(batch):
    return tokenizer(list(batch["clean_Description"].values), 
1 Like