Dataset map and flatten

Hey guys,

After dataset map transformation, I have a new dataset with the following features:

{‘data’: {‘is_random_next’: Value(dtype=‘bool’, id=None),
‘tokens_a’: Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None),
‘tokens_b’: Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None)}}

The data column is useless so after invoking flatten, I get:

{‘data.is_random_next’: Value(dtype=‘bool’, id=None),
‘data.tokens_a’: Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None),
‘data.tokens_b’: Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None)}

which is exactly what I want. After renaming columns, I am done.

Since map forces us to return dict, I need to wrap a list of dict values (is_random_next, tokens_a, tokens_b) with dict to comply, so I did. How could I avoid that?

Quentin @lhoestq I got the wikipedia/bookcorpus dataset processing to be super fast and everything works as advertised. I was wondering if I could somehow bypass the kludge that I have now :fu: for dataset flattening - I am really curious if it could be done without it?

Hi ! Do you create those fields using a map function ? If so maybe you can just edit the map function to return actual columns instead of fields inside the data column. instead of returning

{"data": {"is_random_next":..., "tokens_a":..., "tokens_b":...}}

you can just return

{"is_random_next":..., "tokens_a":..., "tokens_b":...}

Hi, Yes, I do but I have all those dicts

{"is_random_next":..., "tokens_a":..., "tokens_b":...}

in a list. They are input samples for the model. I am transforming the wikipedia dataset. The input is a Wikipedia document and the output is a list of input sequences :fu:

Using batched map you can return more samples than what you have in the input. You can use that to return several input sequences per wikipedia articles.
Let’s say you have a function prepare that takes a wikipedia article text and return a list of dictionaries like

{"is_random_next":..., "tokens_a":..., "tokens_b":...}

Then what you can do with your dataset full of wikipedia article is

def batched_prepare(articles):
    is_random_next = []
    tokens_a = []
    tokens_b = []
    for text in articles["text"]:
        prepared_article = prepare(text)
        is_random_next.append(prepared_article["is_random_next"])
        tokens_a.append(prepared_article["tokens_a"])
        tokens_b.append(prepared_article["tokens_b"])
    return {"is_random_next": is_random_next, "tokens_a": tokens_a, "tokens_b": tokens_b}

prepared_dataset = dataset.map(batched_prepare, batched=True, remove_columns=dataset.column_names)

Let me know if it helps

1 Like

Ok, gotcha. Excellent, thank you!