I get the error:
Exception has occurred: AttributeError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
'Dataset' object has no attribute 'take'
File "/lfs/ampere1/0/brando9/beyond-scale-language-data-diversity/src/diversity/div_coeff.py", line 499, in experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights
batch = dataset.take(batch_size)
File "/lfs/ampere1/0/brando9/beyond-scale-language-data-diversity/src/diversity/div_coeff.py", line 552, in <module>
experiment_compute_diveristy_coeff_single_dataset_then_combined_datasets_with_domain_weights()
File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,
AttributeError: 'Dataset' object has no attribute 'take'
only happens with streaming = False
. How to fix? I do NOT want to stream the data and I want .take to work.
idea: convert HF Dataset to HF datasets.iterable_dataset.IterableDataset
This seem to have worked:
print(f'{dataset=}')
print(f'{type(dataset)=}')
# datasets.iterable_dataset.IterableDataset
# datasets.arrow_dataset.Dataset
dataset = IterableDataset(dataset) if type(dataset) != IterableDataset else dataset # to force dataset.take(batch_size) to work in non-streaming mode
batch = dataset.take(batch_size)
seems to work? Takes long to fetch batch.
Idea 2: Collate fn
If the custom collate fns actually worked this would be simple since the collate fn would receive a batch of batch_size already. See it doesn’t work here: python - How to use huggingface HF trainer train with custom collate function? - Stack Overflow with trainer. Actually but I don’t want to use the trainer…
related gitissue: Allow dataset implement .take · Issue #6150 · huggingface/datasets · GitHub
so: How does one make dataset.take(512) work with streaming = False with hugging face data set? - Stack Overflow