Pretrain and Fine Tune Byte-level model for multilingual extractive QA (Like ByT5)

sidicity · June 24, 2021, 4:37am

I am really fascinated by Byte-level models which have been proposed currently such as ByT5 and Canine that get rid of tokenizers. I have tried probing the ByT5 model to check whether it learns morphological and syntactic features.

Project Proposal: I intend to pre-train the ByT5 model from scratch for a few languages(currently it is trained on mC4 corpus which has 100+ languages) and then fine-tune the model for zero extractive QA. More information can be found here

Goal: At the end, I would like to have insights into how does the ByT5 model performs when random syenitic noise is added to the data (See the last part of the experimental section of the paper for more information).

Languages: As this is a multilingual project, I am considering a set of languages that belong to a family(Like the Indian Languages or European Languages) and the dataset used would be part of mC4 data set(this can be changed if we get more data)

Dataset for Multilingual QA: MLQA or X

Pretraining script; The fine-tuning script is available in the Transformer’s library, pretarning script is available in the official repo of the model

Challenges: Getting the pretraining script running with JAX may be a challenge, though I am not sure how challenging it would be.

bhavnicksm · June 24, 2021, 12:07pm

Hey there, @sidicity,

The project idea sounds interesting and I have been wanting to work on training Multi-lingual models myself.

Count me in!

chewkokwah · June 27, 2021, 5:28am

Count me in.

taisazero · June 28, 2021, 3:57am

I’m interested! Hope I can also contribute a bit.

ceyda · June 28, 2021, 5:30am

Effects of tokenization on extractive QA is one of my past research interests. I would be very interested in seeing the results from this experiment. Even though I think I will participate in another topic for this event.

junhsss · June 28, 2021, 7:49am

@sidicity Love the idea. As a highly agglutinative language user, dealing with current count-based tokenizers is troublesome and doesn’t even seem right. I would love to join this project.

sidicity · June 28, 2021, 9:55am

Transfer learning on the encoder side has been explored to great extent in various papers, I am not sure if there are many papers that do transfer learning on the decoder side especially when the decoder is generative. Like if we had a sequence labeling problem, then transfer of labels is possible from one language to another, but in case of generative decoder, that seems a difficult problem, any pointers here @admins ?

patrickvonplaten · June 28, 2021, 4:59pm

Really like to see this project! Think most of the code exists (the T5-like pretraining code will be merged today/tomorrow for JAX, see: https://github.com/huggingface/transformers/pull/12355).

It will be important to decide on which languages exactly ByT5 should be pretrained on! Think it’s a good idea to make sure that the languages are related/close to each other.

Putting everybody that is interested in an official project for now

patrickvonplaten · June 28, 2021, 5:04pm

So, maybe it makes sense to limit the project to:

PreTrain ByT5-base to a family of languages (to be defined)
Fine-tune the pretrained model on extractive QA
Evaluate on corrupted input
(maybe) compare to officially released multi-lingual checkpoints

Vaibhavbrkn · June 28, 2021, 6:33pm

Would love to contribute in this project.

ibraheemmoosa · June 30, 2021, 2:48pm

I’m also interested in this one. I have trained multilingual models for indian languages. But we transliterated to a common script and then we used sentencepiece for tokenization. It would be great to use a model that can automatically figure out how to tokenize.

patrickvonplaten · July 1, 2021, 10:25am

Great added both of you

patrickvonplaten · July 2, 2021, 12:26am

Did you guys set up a discord channel yet?

Giving you guys directly two TPUs tomorrow! Split the team randomly into two in the official google sheet, but this shouldn’t change anything - just that you have access to 2 TPU v3-8s

Might make organization a bit easier to split work on two VMs!

chewkokwah · July 2, 2021, 9:53pm

Yes, we had setup a Discord channel Flax-HuggingFace-Community-Week . On the TPU, we only received 1 TPU name for our team (which I believe should correspondence to only 1 TPU v3-8 , is the second TPU coming soon or we need to get a second person to put his name in the second slot of the TPU request signing form?

Topic		Replies	Views
PreTrain T5 for Italian 🇮🇹 Flax/JAX Projects	3	618	July 7, 2021
Pre-train DistilmByT5Neo Flax/JAX Projects	7	503	July 2, 2021
Pretrain T5 for Arabic Flax/JAX Projects	17	2685	June 11, 2023
Pretrain T5 from scratch in Dutch Flax/JAX Projects	2	2091	July 7, 2021
Is there any more tokenizer-free language model available? Models	0	561	March 12, 2022

Pretrain and Fine Tune Byte-level model for multilingual extractive QA (Like ByT5)

Related topics