Applying Tapas/TableQuestionAnswering pipelines on a csv via Pandas?

Hi guys! :wave:

Great work on Tapas and the v4.1.1 releaser! :raised_hands:

Is there any guidance on how to apply this pipeline to dataframes uploaded via pandas.read_csv?

Thanks,
Charly

Hello! Here’s how I would setup a pipeline with a pd.DataFrame

from transformers import pipeline
import pandas as pd

tqa_pipeline = pipeline("table-question-answering")

data = {
    "Repository": ["Transformers", "Datasets", "Tokenizers"],
    "Stars": ["36542", "4512", "3934"],
    "Contributors": ["651", "77", "34"],
    "Programming language": ["Python", "Python", "Rust, Python and NodeJS"],
}

queries = "What repository has the largest number of stars?"
table = pd.DataFrame.from_dict(data)

output = tqa_pipeline(table, queries)
# {'answer': 'Transformers', 'coordinates': [(0, 0)], 'cells': ['Transformers']}

If you want to use a CSV file, you also can; here’s the previous example converted to CSV and saved in ~/pipeline.csv:

Repository,Stars,Contributors,Programming language
Transformers,36542,651,Python
Datasets,4512,77,Python
Tokenizers,3934,34,"Rust, Python and NodeJS"

Here’s how I would do (note the type conversion):

from transformers import pipeline
import pandas as pd

tqa_pipeline = pipeline("table-question-answering")

queries = "What repository has the largest number of stars?"
# Convert everything to a string, as the tokenizer can only handle strings
table = pd.read_csv("~/pipeline.csv").astype(str)

output = tqa_pipeline(table, queries)
# {'answer': 'Transformers', 'coordinates': [(0, 0)], 'cells': ['Transformers']}

Hope that helps!

3 Likes

That’s incredibly useful, thanks @lysandre! :pray:

I’m playing with several datasets and I have to say that getting the right answers is sometimes challenging.

Is there any guidance anywhere about a possible syntax/how to do basic operations? (e.g. mean, median etc…)

Thanks again!
Charly