Covid19 Related Question Answering (Closed book question answering)
In 2020, COVID-19 which is caused by a coronavirus called SARS-CoV-2 took over the world. It touched the lives of many people and caused a lot of hardship for humanity. There are still many questions in regards to COVID-19 and it is often difficult to get the right answers. The aim of this project is to finetune models for closed book question answering. In closed-book QA, we feed the model a question without any context or access to external knowledge and train it to predict the answer. Since the model doesn’t receive any context, the primary way it can learn to answer these questions is based on the “knowledge” it obtained during pre-training [1][2].
The main goals of this project are:
Train a model for question answering in regards to COVID-19
Release the top performing models for further research and enhancement
Release all of the preprocessing and postprocessing scripts and findings for future research.
2. Language
The model will be trained in English and python language.
3. Model
Possible model candidate model is pretrained T5-large or Bert variants (e.g. BioBert orBioClinicalBert) with a sequence generation head.
4. Datasets
The following datasets would be used for finetuning the model. Note that the last dataset is optional and the model is evaluated only using Covid-QA.
Hi @patrickvonplaten , I am a data science lead in Bayer AG, I will spread the word and see whether anyone would be interested in joining this activity how many supporters this needs to go forward?
I’m curious if training a language model first on CORD-19 before doing fine-tuning would help.
Also since the COVID QA datasets are relatively small, would it be worthwhile to train on a generic QA dataset (e.g. SQuAD) before training on COVID QA? Or is there a way to mix the datasets to make a more robust model (i.e. for every 5 samples in COVID QA, throw in 1 sample of SQuAD)?
I know relatively nothing about fine-tuning QA models – maybe this approach is already well established as being intractable.
@nbroad great points Nicolas! I am not sure if we will have the bandwidth to do pretraining (or intermediate training) but CORD-19 sounds quite nice! For finetuning, we definitely will need a mixing approach as explained in this COLAB. I think to save time we could very much follow that approach and mix in Covid QA, CDC QA, SQUAD and Trivia. We could use Seqio for this mixing. Would you like to be a part of this project?
This sounds very interesting! I’ve been working on a similar task of Question Generation using T5 and BERT and this would probably be an ideal extension to that. I’ve no prior experience with FLAX/JAX so also would be great to learn about them. I’m interested to join and contribute if there’s still any space. Looking forward to it
I’m a Master’s student and live in Canada. PT/PDT time zone.
I’m a Master’s student from India.
Very meaningful project. I have never worked on the task of Question Generation. There will be a lot to learn for me. I am interested in working on this project.
Hey all
I am a beginner with Transformers and FLAX and want to get into Transformers via a project. Basically I would like to be able to work on getting a minimal training pipeline built and finetune a model using HuggingFace’s awesome resources
Most of my background is CV related in PyTorch. But I do like this idea and happy to get involved. Please let me know if you think I can help
Hi @patrickvonplaten, I have worked on CORD-19 dataset as part of Kaggle Challenge and sort of acquainted with COVID-19 knowledge sources. so can you add me to the group/community. Thanks in advance
The title says “adverse event detection” but based on your approach it looks like you want to try “closed book question answering”.
If “adverse event detection” is all you need then you might want to train a simple encoder only model like pubmed_bert released by Microsoft on entity recognition task.
But in case you want to identify answers to other questions including “adverse events” using a QA approach you might have to first pretrain a seq2seq model on biomedical domain which should have a broad vocab and later finetune it for QA task.
I think closed-book QA is already a challenging task and if we are using a model which is not pretrained on biomedical domain, we might not get desired results.
Still there is no harm in trying, just sharing details based on my experience. Ignore if not useful
Cheers !!
Thanks for your comments! Actually ADE was my first intention but later decided to go towards closed book QA, unfortunately I was not able to update the title! We are aware of the steps.
We are hoping to first pretrain T5 on some medical domain corpora and then finetune on available mixed QA datasets. Feel free to join the discord channel to learn more.