Adding aggregation to TAPAS

I want to add a “portion” operator that will calculate the count of the selected cells divided by the number of rows in the table (count \ total). To answer questions like “what is the portion of players with more than 10 points?” and get “0.32” as an answer.
Count result is already calculated in the _calculate_expected_result method in the modeling_tapas.py:
scaled_probability_per_cell = (scaled_probability_per_cell / numeric_values_scale) * input_mask_float count_result = torch.sum(scaled_probability_per_cell, dim=1)
However, I’m not sure how to continue from here. How to get the number of cells (i thought it is the size of scaled_probability_per_cell, but I’m not sure it is correct in small tables)? How to add the result to the loss? , etc.

Make sure to add it to all_results here, as follows:

all_results = torch.cat(
        [
            torch.unsqueeze(sum_result, dim=1),
            torch.unsqueeze(average_result, dim=1),
            torch.unsqueeze(count_result, dim=1),
            torch.unsqueeze(portion_result, dim=1)
        ],
        dim=1,
    )

This is the only thing you need to do, it will be added to the loss automatically. To compute portion_result, you can do:

portion_result = count_result / some dimension of scaled_probability_per_cell

Can you print out the shape of scaled_probability_per_cell?

The shape is 32, 512. However it will be regardless the number of rows in the table. This is the main problem that i have

So you want to know the number of rows for every example in the batch? You can probably take the max of the unique row IDs in the token type ids created by TapasTokenizer. Small example:

from transformers import TapasTokenizer
import pandas as pd

model_name = 'google/tapas-base-finetuned-wtq'
tokenizer = TapasTokenizer.from_pretrained(model_name)

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
table = pd.DataFrame.from_dict(data)
inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt")

row_ids = inputs["token_type_ids"][:,:,2]

You can easily get the unique row IDs for every example in the batch, and then take the max:

import torch

nrows = [torch.max(torch.unique(example)).item() for example in row_ids]

You can pass this variable to the _calculate_expected_result method, such that you can compute the portion result as:

portion_result = count_result / nrows

Thank you!
I missed the unique rows ids (inputs[“token_type_ids”][:,:,2]) as an indicator for the table size.