Fine-tuning to google/tapas-base-finetuned-wtq to an italian dataset

Hi,
I would like to apply the tapas wtq model on italian dataset and to do that I need to fine-tuning the model on an italian dataset.
I’m going to create a dataset for this application, but to do that I would like to understand how to organize the dataset. I found this example for fine-tuning.
I try to use it for the google/tapas-base-finetuned-wtq model but I receive an error : AttributeError: ‘dict’ object has no attribute ‘iloc’

So my question is: How I have to structure the dataset to fine-tune this model? And how I have to organize the dataset for this task?

Thank you in advance for your help.

Dear, I found the organizzation of the dataset how it has to be done.

to help other people I share my example coda:

import pandas as pd

data = {
‘ID’: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
‘Prodotto’: [‘Smartphone’, ‘Laptop’, ‘TV’, ‘Smartwatch’, ‘Cuffie wireless’,
‘Tablet’, ‘Fotocamera’, ‘Altoparlante Bluetooth’, ‘Console di gioco’,
‘Stampante’, ‘Monitor’, ‘Tastiera’, ‘Mouse’, ‘Router Wi-Fi’, ‘Proiettore’,
‘Hard disk esterno’, ‘Scheda grafica’, ‘Memoria RAM’, ‘SSD’, ‘Alimentatore’,
‘Ventola per PC’, ‘Adattatore USB’, ‘Webcam’, ‘Batteria portatile’,
‘Cavo HDMI’, ‘Cavo USB’, ‘Cavo di alimentazione’, ‘Hub USB’, ‘Microfono’,
‘Scheda madre’, ‘Ventilatore per laptop’, ‘Case per PC’, ‘Custodia per smartphone’,
‘Pellicola protettiva’, ‘Cover per laptop’, ‘Cuffie con microfono’,
‘Borsa per fotocamera’, ‘Tastiera wireless’, ‘Mouse wireless’, ‘Custodia per tablet’,
‘Stilo per smartphone’, ‘Adattatore audio’, ‘Penna USB’, ‘Supporto per smartphone’,
‘Stand per laptop’, ‘Adattatore di rete’, ‘Antenna Wi-Fi’, ‘Presa intelligente’,
‘Adattatore HDMI’, ‘Cavo di rete’],
‘Prezzo’: [999, 1499, 799, 299, 149, 599, 699, 129, 399, 249, 299, 49, 29, 79, 799, 129,
199, 79, 149, 69, 19, 9, 59, 69, 19, 9, 9, 19, 49, 29, 99, 19, 39, 29, 39,
19, 59, 49, 19, 39, 49, 9, 29, 9, 19, 29, 39, 19, 9, 9],
‘Disponibilità’: [10, 5, 8, 15, 20, 3, 6, 12, 2, 4, 7, 18, 25, 9, 1, 14, 11, 17, 22, 16,
13, 21, 19, 8, 7, 10, 15, 3, 6, 4, 5, 9, 12, 11, 2, 1, 3, 8, 5, 2, 6,
4, 7, 10, 15, 13, 9, 11, 6, 3],
‘Categoria’: [‘Elettronica’] * 50
}

domanda = [‘Qual è il prezzo di {{Prodotto}}?’, ‘Quante unità di {{Prodotto}} sono disponibili?’,
‘Che tipo di prodotto è {{Prodotto}}?’]

risposte = [‘Il prezzo di {{Prodotto}} è €{{Prezzo}}.’, ‘Ci sono {{Disponibilità}} unità di {{Prodotto}} disponibili.’,
‘{{Prodotto}} è un/a {{Categoria}}.’]

df = pd.DataFrame(data)
df = df.astype(str)

questions =
answers =
answer_coordinates =

for index, row in df[:25].iterrows():
for i in range(3):
question = domanda[i].replace(‘{{Prodotto}}’, row[‘Prodotto’])
answer = risposte[i].replace(‘{{Prodotto}}’, row[‘Prodotto’]).replace(‘{{Prezzo}}’, str(row[‘Prezzo’])).replace(‘{{Disponibilità}}’, str(row[‘Disponibilità’])).replace(‘{{Categoria}}’, row[‘Categoria’])
questions.append(question)
answers.append(answer)
if ‘prezzo’ in question:
answer_coordinates.append([(index, 1), (index, 2)])
#answer_coordinates.append([(index, 2)])
if ‘disponibili’ in question:
answer_coordinates.append([(index, 1), (index, 3)])
#answer_coordinates.append([(index, 3)])
if ‘tipo’ in question:
answer_coordinates.append([(index, 1), (index, 4)])
#answer_coordinates.append([(index, 4)])

Now the problem is about the dimension in the tokenizer. To be more clear I have this problem:

Token indices sequence length is longer than the specified maximum sequence length for this model (566 > 512). Running this sequence through the model will result in indexing errors.
Traceback (most recent call last):
  File "C:\Users\mamel\anaconda3\envs\PT2x\Lib\site-packages\transformers\tokenization_utils_base.py", line 731, in convert_to_tensors
    tensor = as_tensor(value)
             ^^^^^^^^^^^^^^^^
ValueError: expected sequence of length 566 at dim 1 (got 571)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\mamel\anaconda3\envs\PT2x\Lib\site-packages\transformers\tokenization_utils_base.py", line 747, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (input_ids in this case) have excessive nesting (inputs type list where type int is expected).

If I try to use a slice of the dataset (25 elements instead of 50) I receive an error on the training:

0%|          | 0/6 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\mamel\anaconda3\envs\PT2x\Lib\site-packages\torch\utils\data\dataloader.py", line 633, in next
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\mamel\anaconda3\envs\PT2x\Lib\site-packages\torch\utils\data\dataloader.py", line 677, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mamel\anaconda3\envs\PT2x\Lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mamel\anaconda3\envs\PT2x\Lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~^^^^^
  File "C:\Users\mamel\anaconda3\envs\PT2x\Lib\site-packages\transformers\tokenization_utils_base.py", line 247, in getitem
    raise KeyError(
KeyError: 'Invalid key. Only three types of key are available: (1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.'

hi, this error indicates that you should pass input_ids [Longtensor] to your model