which is exactly what I want. After renaming columns, I am done.
Since map forces us to return dict, I need to wrap a list of dict values (is_random_next, tokens_a, tokens_b) with dict to comply, so I did. How could I avoid that?
Quentin @lhoestq I got the wikipedia/bookcorpus dataset processing to be super fast and everything works as advertised. I was wondering if I could somehow bypass the kludge that I have now for dataset flattening - I am really curious if it could be done without it?
Hi ! Do you create those fields using a map function ? If so maybe you can just edit the map function to return actual columns instead of fields inside the data column. instead of returning
in a list. They are input samples for the model. I am transforming the wikipedia dataset. The input is a Wikipedia document and the output is a list of input sequences
Using batched map you can return more samples than what you have in the input. You can use that to return several input sequences per wikipedia articles.
Let’s say you have a function prepare that takes a wikipedia article text and return a list of dictionaries like