I have a dataset that is multi-label in nature. The explanatory feature is an image. And the labels are the target.
An example of the classes:
classes = [âSmearsâ, âLoaded Languageâ, âName calling/Labelingâ, âGlittering generalities (Virtue)â,
âAppeal to (Strong) Emotionsâ, âAppeal to fear/prejudiceâ, âTransferâ, âDoubtâ,
âExaggeration/Minimisationâ, âWhataboutismâ, âSlogansâ, âFlag-wavingâ,
âMisrepresentation of Someoneâs Position (Straw Man)â, âCausal Oversimplificationâ,
âThought-terminating clichĂ©â, âBlack-and-white Fallacy/Dictatorshipâ, âAppeal to authorityâ,
âReductio ad hitlerumâ, âRepetitionâ, âObfuscation, Intentional vagueness, Confusionâ,
âPresenting Irrelevant Data (Red Herring)â, âBandwagonâ]
I wanted to setup using this pattern: https://theaisummer.com/hugging-face-vit/
The example uses datasets.Dataset from hugginface. I wanted to replicate this same example using my data found here: SEMEVAL-2021-task6-corpus/data at main · di-dimitrov/SEMEVAL-2021-task6-corpus · GitHub.
I loaded my data from the json file:
with open(â./data/training_set_task3/training_set_task3.jsonâ, encoding=âutf-8â) as f: ### Training data
training = json.load(f)
df_training = pd.DataFrame(training)
I use the multilabelbinerizer to convert my multi-labels to one hot encodings, load my images and an example of the dataset is below:
id | labels | text | image | img | label | |
---|---|---|---|---|---|---|
0 | 128 | [Black-and-white Fallacy/Dictatorship, Name ca⊠| THERE ARE ONLY TWO GENDERS\n\nFEMALE \n\nMALE\n | 128_image.png | [[[255, 255, 255], [255, 255, 255], [255, 255,⊠| [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ⊠|
1 | 189 | [Reductio ad hitlerum, Smears, Transfer] | This is not an accident! | 189_image.png | [[[11, 15, 64], [11, 15, 64], [11, 15, 64], [1⊠| [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ⊠|
After subsetting my dataset here is the modeling dataset.
df_train = df_train.rename(columns={âlabelâ:âlabelsâ})
features = Features({
âlabelsâ: ClassLabel(
names= classes),
âimgâ: Array3D(dtype=âint64â, shape=(3, 32, 32)),
âpixel_valuesâ: Array3D(dtype=âfloat32â, shape=(3, 224, 224)),
})
preprocessed_train_ds = dataset_training.map(preprocess_images, batched=True, batch_size = 8, features=features)
This last statement generates an error because I failed to account the example is using a single label. I was unable to find an equivalent usage for multi-labels. Is this feature available? Is it coming? Any recommendations for how to address using this pattern? Any ideas would be greatly appreciated. Worst case, I can use this pattern and go back to the other methods of processing but when I tried the example, I loved how everything was self contained and easy to organize.
My goal is to use this visual transformer model from huggingface to process my images. I have used hugging face for the text components so wanted to keep both techniques similar for an understandability perspective.
Error:
ArrowInvalid Traceback (most recent call last)
in
----> 1 preprocessed_train_ds = dataset_training.map(preprocess_images, batched=True, batch_size = 8, features=features)
~\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_dataset.py in map(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
1663
1664 if num_proc is None or num_proc == 1:
â 1665 return self._map_single(
1666 function=function,
1667 with_indices=with_indices,
~\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_dataset.py in wrapper(*args, **kwargs)
183 }
184 # apply actual function
â 185 out: Union[âDatasetâ, âDatasetDictâ] = func(self, *args, **kwargs)
186 datasets: List[âDatasetâ] = list(out.values()) if isinstance(out, dict) else [out]
187 # re-apply format to the output
~\anaconda3\envs\pytorch\lib\site-packages\datasets\fingerprint.py in wrapper(*args, **kwargs)
395 # Call actual function
396
â 397 out = func(self, *args, **kwargs)
398
399 # Update fingerprint of in-place transforms + update in-place history of transforms
~\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_dataset.py in _map_single(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc)
2032 else:
2033 batch = cast_to_python_objects(batch)
â 2034 writer.write_batch(batch)
2035 if update_data and writer is not None:
2036 writer.finalize() # close_stream=bool(buf_writer is None)) # We only close if we are writing in a file
~\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_writer.py in write_batch(self, batch_examples, writer_batch_size)
389 typed_sequence = OptimizedTypedSequence(batch_examples[col], type=col_type, try_type=col_try_type, col=col)
390 typed_sequence_examples[col] = typed_sequence
â 391 pa_table = pa.Table.from_pydict(typed_sequence_examples)
392 self.write_table(pa_table, writer_batch_size)
393
~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\table.pxi in pyarrow.lib.Table.from_pydict()
~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\array.pxi in pyarrow.lib.asarray()
~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\array.pxi in pyarrow.lib.array()
~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\array.pxi in pyarrow.lib._handle_arrow_array_protocol()
~\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_writer.py in arrow_array(self, type)
96 out = pa.ExtensionArray.from_storage(type, pa.array(self.data, type.storage_dtype))
97 else:
â> 98 out = pa.array(self.data, type=type)
99 if trying_type and out[0].as_py() != self.data[0]:
100 raise TypeError(
~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\array.pxi in pyarrow.lib.array()
~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\array.pxi in pyarrow.lib._sequence_to_array()
~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~\anaconda3\envs\pytorch\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Could not convert [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0] with type list: tried to convert to int
Thanks,
John R.