Dataset map and flatten

vblagoje · October 7, 2020, 12:01pm

Hey guys,

After dataset map transformation, I have a new dataset with the following features:

{‘data’: {‘is_random_next’: Value(dtype=‘bool’, id=None),
‘tokens_a’: Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None),
‘tokens_b’: Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None)}}

The data column is useless so after invoking flatten, I get:

{‘data.is_random_next’: Value(dtype=‘bool’, id=None),
‘data.tokens_a’: Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None),
‘data.tokens_b’: Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None)}

which is exactly what I want. After renaming columns, I am done.

Since map forces us to return dict, I need to wrap a list of dict values (is_random_next, tokens_a, tokens_b) with dict to comply, so I did. How could I avoid that?

vblagoje · October 8, 2020, 6:37am

Quentin @lhoestq I got the wikipedia/bookcorpus dataset processing to be super fast and everything works as advertised. I was wondering if I could somehow bypass the kludge that I have now for dataset flattening - I am really curious if it could be done without it?

lhoestq · October 12, 2020, 8:06am

Hi ! Do you create those fields using a map function ? If so maybe you can just edit the map function to return actual columns instead of fields inside the data column. instead of returning

{"data": {"is_random_next":..., "tokens_a":..., "tokens_b":...}}

you can just return

{"is_random_next":..., "tokens_a":..., "tokens_b":...}

vblagoje · October 12, 2020, 8:28am

Hi, Yes, I do but I have all those dicts

{"is_random_next":..., "tokens_a":..., "tokens_b":...}

in a list. They are input samples for the model. I am transforming the wikipedia dataset. The input is a Wikipedia document and the output is a list of input sequences

lhoestq · October 12, 2020, 8:51am

Using batched map you can return more samples than what you have in the input. You can use that to return several input sequences per wikipedia articles.
Let’s say you have a function prepare that takes a wikipedia article text and return a list of dictionaries like

{"is_random_next":..., "tokens_a":..., "tokens_b":...}

Then what you can do with your dataset full of wikipedia article is

def batched_prepare(articles):
    is_random_next = []
    tokens_a = []
    tokens_b = []
    for text in articles["text"]:
        prepared_article = prepare(text)
        is_random_next.append(prepared_article["is_random_next"])
        tokens_a.append(prepared_article["tokens_a"])
        tokens_b.append(prepared_article["tokens_b"])
    return {"is_random_next": is_random_next, "tokens_a": tokens_a, "tokens_b": tokens_b}

prepared_dataset = dataset.map(batched_prepare, batched=True, remove_columns=dataset.column_names)

Let me know if it helps

vblagoje · October 12, 2020, 8:57am

Ok, gotcha. Excellent, thank you!

Topic		Replies	Views
Flatten List of features 🤗Datasets	1	1647	April 7, 2022
Datasets map issues 🤗Datasets	3	520	February 23, 2023
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1078	August 19, 2021
I have a dataset of texts that I want to split into shorter texts 🤗Datasets	1	1063	October 16, 2023
Programmatic way to Tokenization on Custom Text Columns 🤗Tokenizers	0	568	June 27, 2022

Dataset map and flatten

Related topics