Covid19 adverse event detection

Covid19 Related Question Answering (Closed book question answering)

In 2020, COVID-19 which is caused by a coronavirus called SARS-CoV-2 took over the world. It touched the lives of many people and caused a lot of hardship for humanity. There are still many questions in regards to COVID-19 and it is often difficult to get the right answers. The aim of this project is to finetune models for closed book question answering. In closed-book QA, we feed the model a question without any context or access to external knowledge and train it to predict the answer. Since the model doesn’t receive any context, the primary way it can learn to answer these questions is based on the “knowledge” it obtained during pre-training [1] [2].

The main goals of this project are:

  1. Train a model for question answering in regards to COVID-19
  2. Release the top performing models for further research and enhancement
  3. Release all of the preprocessing and postprocessing scripts and findings for future research.

2. Language

The model will be trained in English and python language.

3. Model

Possible model candidate model is pretrained T5-large or Bert variants (e.g. BioBert orBioClinicalBert) with a sequence generation head.

4. Datasets

The following datasets would be used for finetuning the model. Note that the last dataset is optional and the model is evaluated only using Covid-QA.

  1. Covid-QA
  2. CDC-QA
  3. Optional - Trivia-QA

5. Training scripts

We can make use of :

  1. For preprocessing and mixing datasets
  2. For T5 training

8. (Optional) Reads

The following links can be useful to better understand the project and
what has previously been done.

5 Likes

That’s a super nice & in-detail description! I’d really like this project to take place :slight_smile: Are there other ways we can promote this project? :slight_smile:

1 Like

Hi @patrickvonplaten , I am a data science lead in Bayer AG, I will spread the word and see whether anyone would be interested in joining this activity :slight_smile: how many supporters this needs to go forward?

Hi @hooman650, the project is a good idea and I would very much like to participate in the development of this project.

This is a great project! Happy to join if I can be of help!

I’m curious if training a language model first on CORD-19 before doing fine-tuning would help.

Also since the COVID QA datasets are relatively small, would it be worthwhile to train on a generic QA dataset (e.g. SQuAD) before training on COVID QA? Or is there a way to mix the datasets to make a more robust model (i.e. for every 5 samples in COVID QA, throw in 1 sample of SQuAD)?

I know relatively nothing about fine-tuning QA models – maybe this approach is already well established as being intractable.

1 Like

Awesome finalizing this :slight_smile:

1 Like

@nbroad great points Nicolas! I am not sure if we will have the bandwidth to do pretraining (or intermediate training) but CORD-19 sounds quite nice! For finetuning, we definitely will need a mixing approach as explained in this COLAB. I think to save time we could very much follow that approach and mix in Covid QA, CDC QA, SQUAD and Trivia. We could use Seqio for this mixing. Would you like to be a part of this project?

This sounds very interesting! I’ve been working on a similar task of Question Generation using T5 and BERT and this would probably be an ideal extension to that. I’ve no prior experience with FLAX/JAX so also would be great to learn about them. I’m interested to join and contribute if there’s still any space. Looking forward to it :slight_smile:

I’m a Master’s student and live in Canada. PT/PDT time zone.

2 Likes

Hello @hooman650,
I am interested to be a part of this wonderful project…

2 Likes

added you @Shravanthi and @srisweet :slight_smile:

2 Likes

I’m a Master’s student from India.
Very meaningful project. I have never worked on the task of Question Generation. There will be a lot to learn for me. I am interested in working on this project.

2 Likes

I just created a channel for this project on discord, please feel free to join there we will be talking about the project logistics and planning

1 Like

Added you @Ankit-Kumar-Saini :slight_smile:

Hey all
I am a beginner with Transformers and FLAX :sweat_smile: and want to get into Transformers via a project. Basically I would like to be able to work on getting a minimal training pipeline built and finetune a model using HuggingFace’s awesome resources

Most of my background is CV related in PyTorch. But I do like this idea and happy to get involved. Please let me know if you think I can help

1 Like

Hi @patrickvonplaten, I have worked on CORD-19 dataset as part of Kaggle Challenge and sort of acquainted with COVID-19 knowledge sources. so can you add me to the group/community. Thanks in advance

1 Like

@patrickvonplaten our project still does not have access to TPU VMs. I sent you my gmail details and ids via private message in slack.

Hi,

The title says “adverse event detection” but based on your approach it looks like you want to try “closed book question answering”.

If “adverse event detection” is all you need then you might want to train a simple encoder only model like pubmed_bert released by Microsoft on entity recognition task.

But in case you want to identify answers to other questions including “adverse events” using a QA approach you might have to first pretrain a seq2seq model on biomedical domain which should have a broad vocab and later finetune it for QA task.

I think closed-book QA is already a challenging task and if we are using a model which is not pretrained on biomedical domain, we might not get desired results.

Still there is no harm in trying, just sharing details based on my experience. Ignore if not useful :wink:
Cheers !!

1 Like

@patrickvonplaten can I still be added this project?

1 Like

Thanks for your comments! Actually ADE was my first intention but later decided to go towards closed book QA, unfortunately I was not able to update the title! We are aware of the steps.

We are hoping to first pretrain T5 on some medical domain corpora and then finetune on available mixed QA datasets. Feel free to join the discord channel to learn more.

2 Likes