Adding aggregation to TAPAS

Omri · November 29, 2021, 3:33pm

I want to add a “portion” operator that will calculate the count of the selected cells divided by the number of rows in the table (count \ total). To answer questions like “what is the portion of players with more than 10 points?” and get “0.32” as an answer.
Count result is already calculated in the _calculate_expected_result method in the modeling_tapas.py:
scaled_probability_per_cell = (scaled_probability_per_cell / numeric_values_scale) * input_mask_float count_result = torch.sum(scaled_probability_per_cell, dim=1)
However, I’m not sure how to continue from here. How to get the number of cells (i thought it is the size of scaled_probability_per_cell, but I’m not sure it is correct in small tables)? How to add the result to the loss? , etc.

nielsr · November 29, 2021, 4:15pm

Make sure to add it to all_results here, as follows:

all_results = torch.cat(
        [
            torch.unsqueeze(sum_result, dim=1),
            torch.unsqueeze(average_result, dim=1),
            torch.unsqueeze(count_result, dim=1),
            torch.unsqueeze(portion_result, dim=1)
        ],
        dim=1,
    )

This is the only thing you need to do, it will be added to the loss automatically. To compute portion_result, you can do:

portion_result = count_result / some dimension of scaled_probability_per_cell

Can you print out the shape of scaled_probability_per_cell?

Omri · December 2, 2021, 9:26am

The shape is 32, 512. However it will be regardless the number of rows in the table. This is the main problem that i have

nielsr · December 2, 2021, 12:36pm

So you want to know the number of rows for every example in the batch? You can probably take the max of the unique row IDs in the token type ids created by TapasTokenizer. Small example:

from transformers import TapasTokenizer
import pandas as pd

model_name = 'google/tapas-base-finetuned-wtq'
tokenizer = TapasTokenizer.from_pretrained(model_name)

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
table = pd.DataFrame.from_dict(data)
inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt")

row_ids = inputs["token_type_ids"][:,:,2]

You can easily get the unique row IDs for every example in the batch, and then take the max:

import torch

nrows = [torch.max(torch.unique(example)).item() for example in row_ids]

You can pass this variable to the _calculate_expected_result method, such that you can compute the portion result as:

portion_result = count_result / nrows

Omri · December 14, 2021, 12:10pm

Thank you!
I missed the unique rows ids (inputs[“token_type_ids”][:,:,2]) as an indicator for the table size.

Topic		Replies	Views
Other aggregation on TAPAS beyond (SUM/COUNT/AVERAGE/NONE) Intermediate	13	1250	September 18, 2023
Unable to convert Huggingface model to torchscript Beginners	1	472	June 10, 2023
Adding a new model to Transformers with additional dependencies 🤗Transformers	15	1458	October 19, 2020
Example for TAPAS MaskedLM fails to run Models	0	1140	July 26, 2021
Model doesn't return error, but shows no answer found Beginners	0	192	August 31, 2023

Adding aggregation to TAPAS

Related topics