I have a hugging face dataset object and I want to filter it as I would a pandas dataframe:
train[train['language']=="English"]
‘language’ is one of the features in the train split.
I also tried:
train.select(train['language']=='English')
and got error:
TypeError: ‘bool’ object is not iterable
And similarly, boolean masking, with the same error code:
is_english = dataset['train']['language'] == 'English'
eng_convos = dataset['train'].select(is_english)
Or a list comprehnsion:
eng_convos = [dataset['train']['conversation'] for dataset['train']['conversation'] in dataset['train'] if dataset['train']['language']=='English']
I got:
TypeError: ‘Dataset’ object does not support item assignment
Lastly, I tried follwoing advice from this discussion post Filtering Dataset
import pyarrow as pa
import pyarrow.compute as compute
table = dataset.data
flags = compute.is_in(train['language'], value_set=pa.array(['English'], pa.string()))
filtered_table = train.filter(flags)
filtered_table.to_pandas()
and got error: 557 # apply actual function
→ 558 out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
559 datasets: List[“Dataset”] = list(out.values()) if isinstance(out, dict) else [out]
560 # re-apply format to the output
File c:\Users\Admin\HC3 EDA.venv\Lib\site-packages\datasets\fingerprint.py:482, in fingerprint_transform.._fingerprint..wrapper(*args, **kwargs)
478 validate_fingerprint(kwargs[fingerprint_name])
480 # Call actual function
→ 482 out = func(dataset, *args, **kwargs)
484 # Update fingerprint of in-place transforms + update in-place history of transforms
…
6260 else:
6261 # inputs is a list of columns
6262 columns: List[List] = inputs
TypeError: ‘pyarrow.lib.BooleanArray’ object is not callable