How to filter Datasets object?

I have a hugging face dataset object and I want to filter it as I would a pandas dataframe:

train[train['language']=="English"]

‘language’ is one of the features in the train split.
I also tried:

train.select(train['language']=='English')

and got error:
TypeError: ‘bool’ object is not iterable
And similarly, boolean masking, with the same error code:

is_english = dataset['train']['language'] == 'English'
eng_convos = dataset['train'].select(is_english)

Or a list comprehnsion:

eng_convos = [dataset['train']['conversation'] for dataset['train']['conversation'] in dataset['train'] if dataset['train']['language']=='English']

I got:
TypeError: ‘Dataset’ object does not support item assignment

Lastly, I tried follwoing advice from this discussion post Filtering Dataset

import pyarrow as pa

import pyarrow.compute as compute

table = dataset.data

flags = compute.is_in(train['language'], value_set=pa.array(['English'], pa.string()))

filtered_table = train.filter(flags)

filtered_table.to_pandas()

and got error: 557 # apply actual function
→ 558 out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
559 datasets: List[“Dataset”] = list(out.values()) if isinstance(out, dict) else [out]
560 # re-apply format to the output

File c:\Users\Admin\HC3 EDA.venv\Lib\site-packages\datasets\fingerprint.py:482, in fingerprint_transform.._fingerprint..wrapper(*args, **kwargs)
478 validate_fingerprint(kwargs[fingerprint_name])
480 # Call actual function
→ 482 out = func(dataset, *args, **kwargs)
484 # Update fingerprint of in-place transforms + update in-place history of transforms

6260 else:
6261 # inputs is a list of columns
6262 columns: List[List] = inputs

TypeError: ‘pyarrow.lib.BooleanArray’ object is not callable

All I want to do is essentially filter my HF Datasets object as I would a pandas dataframe i.e. only return the dataset where a specific column value meets my condition…

Found a solution:

  1. converted the datasets object to pandas DataFrame
import pandas as pd
df = pd.DataFrame(dataset['train'])
  1. filtered it as usual
english_only = df[df['language']=="English"]

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.