How to add a new column to a dataset

How can I add a new column to my dataset?

I am working on Cosmos QA dataset and need to add a new column of the following format:
Value(dtype=‘string’, id=None)

The current dataset has the following features:
Dataset({
features: [‘id’, ‘context’, ‘question’, ‘answer0’, ‘answer1’, ‘answer2’, ‘answer3’, ‘label’],
num_rows: 25262
})

Thank you in advance!

Hi ! You can use the add_column method:

from datasets import load_dataset

ds = load_dataset("cosmos_qa", split="train")

new_column = ["foo"] * len(ds)
ds = ds.add_column("new_column", new_column)

and you get a dataset

Dataset({
    features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label', 'new_column'],
    num_rows: 25262
})
6 Likes

Thank you for your reply!

Unfortunately, when I ran the above code, I got the following error:
AttributeError: ‘Dataset’ object has no attribute 'add_column’

Just FYI, Datasets version = 1.1.2

Thank you!

You need to update your datasets library to the latest version for this.

1 Like

Can you provide more details on this? I am running into the same problem as @ chetnakhanna16

On your machine you need to run the command pip install datasets --upgrade to update your dataset library to the latest version.

Hope that helps!

1 Like

Here is my current dataset version
datasets==2.0.0
Which is the same after running the suggested command. Still the problem persists.

In essence I am trying to find a way to add the input_ids and attention_mask back into my dataset.

The reason why I am doing this, is that when running this function:

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
def tokenize(batch):
    return tokenizer(batch["text"], 
                     padding=True, 
                     truncation=True)

on a sample of my dataset → print(tokenize(clean_dataset["train"][:2])) I get the following error
ValueError: text input must of type str(single example),List[str](batch or single pretokenized example) orList[List[str]] (batch of pretokenized examples).

However, when I run this complete_tok = tokenizer(list(x_complete), truncation=True, padding=True) where x_complete is an np array the tokenizer seems to run fine and creates input_ids and attention_mask

Can you verify that clean_dataset["train"][:2] is indeed of type str?

1 Like

Found my mistake → needed to cast the clean_Description batch values to a list, code runs now. Thank you :disguised_face: :nerd_face:

def tokenize(batch):
    return tokenizer(list(batch["clean_Description"].values), 
                     padding=True, 
                     truncation=True)
1 Like

Hi guys, I am new here, just started using :hugs: Those like me who are facing the same issue, I think the error is because of dataset type is ‘dict’ and that’s why it gives AttributeError: ‘Dataset’ object has no attribute 'add_column’

My dataset structure is Dataset({
train: [‘index’, ‘text’, ‘file’],
num_rows: 5000
})

So this is how I solved it

my_data[“train”] = my_data[“train”].add_column(“audio”,temp_data[‘audio’])

Hope I am not doing anything wrong.

1 Like

Should be a list, you could convert the temp_data[“audio”] to a list using

temp_data["audio].to_list()

1 Like

I had trouble with the add_column method, maybe it has been deprecated since this post?
However it is possible to create a Dataset directly from a Python dictionary using the Dataset.from_dict method. Using this it is possible to add a column to a Dataset by extracting existing Dataset columns into a Python dictionary, updating the dictionary with the desired column, then re-creating the Dataset object with the additional column(s).