Pretrain and Fine Tune Byte-level model for multilingual extractive QA (Like ByT5)

I am really fascinated by Byte-level models which have been proposed currently such as ByT5 and Canine that get rid of tokenizers. I have tried probing the ByT5 model to check whether it learns morphological and syntactic features.

Project Proposal: I intend to pre-train the ByT5 model from scratch for a few languages(currently it is trained on mC4 corpus which has 100+ languages) and then fine-tune the model for zero extractive QA. More information can be found here

Goal: At the end, I would like to have insights into how does the ByT5 model performs when random syenitic noise is added to the data (See the last part of the experimental section of the paper for more information).

Languages: As this is a multilingual project, I am considering a set of languages that belong to a family(Like the Indian Languages or European Languages) and the dataset used would be part of mC4 data set(this can be changed if we get more data)

Dataset for Multilingual QA: MLQA or X

Pretraining script; The fine-tuning script is available in the Transformer’s library, pretarning script is available in the official repo of the model

Challenges: Getting the pretraining script running with JAX may be a challenge, though I am not sure how challenging it would be.


Hey there, @sidicity,

The project idea sounds :sparkles: interesting :sparkles: and I have been wanting to work on training Multi-lingual models myself.

Count me in!

1 Like

Count me in.

1 Like

@sidicity Hi! I’ve personally working on Byte-Level language models based on the byt5 paper. I think it will be great if I can make some contribution in this project!

1 Like

I’m interested! Hope I can also contribute a bit.

1 Like

Effects of tokenization on extractive QA is one of my past research interests. I would be very interested in seeing the results from this experiment. Even though I think I will participate in another topic for this event.


@sidicity Love the idea. As a highly agglutinative language user, dealing with current count-based tokenizers is troublesome and doesn’t even seem right. I would love to join this project. :+1:t2:

1 Like

@sidicity , this appears interesting. One thing, I thought is, if transfer learning possible with tokenizer free models for language translation. Say the base mode is English to German, then it will be interesting to see if we can perform transfer learning to perform translation from English to Hindi etc. Looking forward to more thoughts and ideas on this.

1 Like

Transfer learning on the encoder side has been explored to great extent in various papers, I am not sure if there are many papers that do transfer learning on the decoder side especially when the decoder is generative. Like if we had a sequence labeling problem, then transfer of labels is possible from one language to another, but in case of generative decoder, that seems a difficult problem, any pointers here @admins ?

Really like to see this project! Think most of the code exists (the T5-like pretraining code will be merged today/tomorrow for JAX, see:

It will be important to decide on which languages exactly ByT5 should be pretrained on! Think it’s a good idea to make sure that the languages are related/close to each other.

Putting everybody that is interested in an official project for now :slight_smile:


So, maybe it makes sense to limit the project to:

  1. PreTrain ByT5-base to a family of languages (to be defined)
  2. Fine-tune the pretrained model on extractive QA
  3. Evaluate on corrupted input
  4. (maybe) compare to officially released multi-lingual checkpoints

Would love to contribute in this project.


I’m also interested in this one. I have trained multilingual models for indian languages. But we transliterated to a common script and then we used sentencepiece for tokenization. It would be great to use a model that can automatically figure out how to tokenize.

1 Like

Great added both of you :slight_smile:

Did you guys set up a discord channel yet?

Giving you guys directly two TPUs tomorrow! Split the team randomly into two in the official google sheet, but this shouldn’t change anything - just that you have access to 2 TPU v3-8s :slight_smile:

Might make organization a bit easier to split work on two VMs!

Yes, we had setup a Discord channel Flax-HuggingFace-Community-Week . On the TPU, we only received 1 TPU name for our team (which I believe should correspondence to only 1 TPU v3-8 , is the second TPU coming soon or we need to get a second person to put his name in the second slot of the TPU request signing form?