Multi-class Using Dataset

I have a dataset that is multi-label in nature. The explanatory feature is an image. And the labels are the target.

An example of the classes:
classes = [‘Smears’, ‘Loaded Language’, ‘Name calling/Labeling’, ‘Glittering generalities (Virtue)’,
‘Appeal to (Strong) Emotions’, ‘Appeal to fear/prejudice’, ‘Transfer’, ‘Doubt’,
‘Exaggeration/Minimisation’, ‘Whataboutism’, ‘Slogans’, ‘Flag-waving’,
“Misrepresentation of Someone’s Position (Straw Man)”, ‘Causal Oversimplification’,
‘Thought-terminating cliché’, ‘Black-and-white Fallacy/Dictatorship’, ‘Appeal to authority’,
‘Reductio ad hitlerum’, ‘Repetition’, ‘Obfuscation, Intentional vagueness, Confusion’,
‘Presenting Irrelevant Data (Red Herring)’, ‘Bandwagon’]

I wanted to setup using this pattern: https://theaisummer.com/hugging-face-vit/

The example uses datasets.Dataset from hugginface. I wanted to replicate this same example using my data found here: SEMEVAL-2021-task6-corpus/data at main · di-dimitrov/SEMEVAL-2021-task6-corpus · GitHub.

I loaded my data from the json file:
with open(‘./data/training_set_task3/training_set_task3.json’, encoding=‘utf-8’) as f: ### Training data
training = json.load(f)

df_training = pd.DataFrame(training)

I use the multilabelbinerizer to convert my multi-labels to one hot encodings, load my images and an example of the dataset is below:

id labels text image img label
0 128 [Black-and-white Fallacy/Dictatorship, Name ca
 THERE ARE ONLY TWO GENDERS\n\nFEMALE \n\nMALE\n 128_image.png [[[255, 255, 255], [255, 255, 255], [255, 255,
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 

1 189 [Reductio ad hitlerum, Smears, Transfer] This is not an accident! 189_image.png [[[11, 15, 64], [11, 15, 64], [11, 15, 64], [1
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 


After subsetting my dataset here is the modeling dataset.

df_train = df_train.rename(columns={‘label’:‘labels’})

features = Features({
‘labels’: ClassLabel(
names= classes),
‘img’: Array3D(dtype=“int64”, shape=(3, 32, 32)),
‘pixel_values’: Array3D(dtype=“float32”, shape=(3, 224, 224)),
})

preprocessed_train_ds = dataset_training.map(preprocess_images, batched=True, batch_size = 8, features=features)

This last statement generates an error because I failed to account the example is using a single label. I was unable to find an equivalent usage for multi-labels. Is this feature available? Is it coming? Any recommendations for how to address using this pattern? Any ideas would be greatly appreciated. Worst case, I can use this pattern and go back to the other methods of processing but when I tried the example, I loved how everything was self contained and easy to organize.

My goal is to use this visual transformer model from huggingface to process my images. I have used hugging face for the text components so wanted to keep both techniques similar for an understandability perspective.

Error:

ArrowInvalid Traceback (most recent call last)
in
----> 1 preprocessed_train_ds = dataset_training.map(preprocess_images, batched=True, batch_size = 8, features=features)

~\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_dataset.py in map(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
1663
1664 if num_proc is None or num_proc == 1:
→ 1665 return self._map_single(
1666 function=function,
1667 with_indices=with_indices,

~\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_dataset.py in wrapper(*args, **kwargs)
183 }
184 # apply actual function
→ 185 out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
186 datasets: List[“Dataset”] = list(out.values()) if isinstance(out, dict) else [out]
187 # re-apply format to the output

~\anaconda3\envs\pytorch\lib\site-packages\datasets\fingerprint.py in wrapper(*args, **kwargs)
395 # Call actual function
396
→ 397 out = func(self, *args, **kwargs)
398
399 # Update fingerprint of in-place transforms + update in-place history of transforms

~\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_dataset.py in _map_single(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc)
2032 else:
2033 batch = cast_to_python_objects(batch)
→ 2034 writer.write_batch(batch)
2035 if update_data and writer is not None:
2036 writer.finalize() # close_stream=bool(buf_writer is None)) # We only close if we are writing in a file

~\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_writer.py in write_batch(self, batch_examples, writer_batch_size)
389 typed_sequence = OptimizedTypedSequence(batch_examples[col], type=col_type, try_type=col_try_type, col=col)
390 typed_sequence_examples[col] = typed_sequence
→ 391 pa_table = pa.Table.from_pydict(typed_sequence_examples)
392 self.write_table(pa_table, writer_batch_size)
393

~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\table.pxi in pyarrow.lib.Table.from_pydict()

~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\array.pxi in pyarrow.lib.asarray()

~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\array.pxi in pyarrow.lib.array()

~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\array.pxi in pyarrow.lib._handle_arrow_array_protocol()

~\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_writer.py in arrow_array(self, type)
96 out = pa.ExtensionArray.from_storage(type, pa.array(self.data, type.storage_dtype))
97 else:
—> 98 out = pa.array(self.data, type=type)
99 if trying_type and out[0].as_py() != self.data[0]:
100 raise TypeError(

~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\array.pxi in pyarrow.lib.array()

~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\array.pxi in pyarrow.lib._sequence_to_array()

~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Could not convert [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0] with type list: tried to convert to int

Thanks,

John R.

Hi !

The ClassLabel feature type is for single-label multi-class classification.
For multi-label classification you can use

from datasets.features import ClassLabel, Sequence
labels_type = Sequence(ClassLabel(names=classes))

and modify your the function you pass to map to convert your list of 21 booleans to the list of True indices. For example:

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0] → [4, 12, 18]

Thank you for providing that explanation. I will try that and report back. :slight_smile:

That is a fascinating approach, I had not considered converting it to a list of true indices. I will need to look up this technique.

Did you get this to work? I am trying to get multi-label classification working and tried @lhoestq 's technique, but got further using my list of 0’s and 1’s than I did with True indices.

Do you have any sample notebooks I can look at? Thanks.

Hi, any update on how to correctly train the model using sparse representation instead of the one-hot encoded vectors?

How would you fit this dataset to the trainer?