I’m trying to use the TAPAS model (specifically, google/tapas-base
) to generate an embedding for a table from Wikipedia. I have data that looks like this:
table_data = {'column_header': ['name',
'nationality',
'birth_date',
'article_title',
'occupation'],
'content': ['walter extra',
'german',
'1954',
'walter extra\n',
'aircraft designer and manufacturer'],
}
And I’m generating a representation like this:
tokenizer = transformers.TapasTokenizer.from_pretrained("google/tapas-base")
model = transformers.TapasModel.from_pretrained("google/tapas-base")
model.to(device)
df = pd.DataFrame(columns=table_data['column_header'], table_data=[table['content']])
inputs = tokenizer(
table=df,
padding="max_length",
return_tensors="pt"
)
with torch.no_grad():
output = model(**inputs)
encoding = output.pooler_output.squeeze(dim=0).cpu()
Also I’m getting the following warning:
TAPAS is a question answering model but you have not passed a query. Please be aware that the model will probably not behave correctly.
but I think that’s ok, since I just want to use TAPAS to generate an embedding for a table– I’m not doing question-answering.
However, I’m observing that the generate representation (encoding
in my code) is not very useful, it’s actually significantly less useful for my task than just passing the table to bert-base as a string. This doesn’t seem right, and I think I must be using TAPAS incorrectly somehow, but I can’t see what I’m doing wrong. Can someone familiar with TAPAS take a look at my usage and let me know if anything looks out of order?