How does one make dataset.take(512) work with streaming = False with hugging face data set?

I get the error:

Exception has occurred: AttributeError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
'Dataset' object has no attribute 'take'
  File "/lfs/ampere1/0/brando9/beyond-scale-language-data-diversity/src/diversity/div_coeff.py", line 499, in experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights
    batch = dataset.take(batch_size)
  File "/lfs/ampere1/0/brando9/beyond-scale-language-data-diversity/src/diversity/div_coeff.py", line 552, in <module>
    experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights()
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
AttributeError: 'Dataset' object has no attribute 'take'

only happens with streaming = False. How to fix? I do NOT want to stream the data and I want .take to work.


idea: convert HF Dataset to HF datasets.iterable_dataset.IterableDataset

This seem to have worked:

    print(f'{dataset=}')
    print(f'{type(dataset)=}')
    # datasets.iterable_dataset.IterableDataset
    # datasets.arrow_dataset.Dataset
    dataset = IterableDataset(dataset) if type(dataset) != IterableDataset else dataset  # to force dataset.take(batch_size) to work in non-streaming mode
    batch = dataset.take(batch_size)

seems to work? Takes long to fetch batch.


Idea 2: Collate fn

If the custom collate fns actually worked this would be simple since the collate fn would receive a batch of batch_size already. See it doesn’t work here: python - How to use huggingface HF trainer train with custom collate function? - Stack Overflow with trainer. Actually but I don’t want to use the trainer…


related gitissue: Allow dataset implement .take · Issue #6150 · huggingface/datasets · GitHub
so: How does one make dataset.take(512) work with streaming = False with hugging face data set? - Stack Overflow

Hi ! You can replace .take(512) by .select(range(512))

Take hasn’t been implemented yet but will be easy to add

1 Like

@lhoestq can we implement this? So that the code doesn’t crash? Easier to mantain a single code base?

gitissue I made: Allow dataset implement .take · Issue #6150 · huggingface/datasets · GitHub

I will clarify with the full code because batch objects are also dataset objects so your answer is slightly ambiguous. Just for clarify, you mean this:

batch = dataset.select(range(512))

right? Thanks btw! :slight_smile:

will range always give me the same data points? Any random numbers I need to generate?

Sure, it simply select the examples from index 0 to 512.

1 Like