I am really fascinated by Byte-level models which have been proposed currently such as ByT5 and Canine that get rid of tokenizers. I have tried probing the ByT5 model to check whether it learns morphological and syntactic features.
Project Proposal: I intend to pre-train the ByT5 model from scratch for a few languages(currently it is trained on mC4 corpus which has 100+ languages) and then fine-tune the model for zero extractive QA. More information can be found here
Goal: At the end, I would like to have insights into how does the ByT5 model performs when random syenitic noise is added to the data (See the last part of the experimental section of the paper for more information).
Languages: As this is a multilingual project, I am considering a set of languages that belong to a family(Like the Indian Languages or European Languages) and the dataset used would be part of mC4 data set(this can be changed if we get more data)
Effects of tokenization on extractive QA is one of my past research interests. I would be very interested in seeing the results from this experiment. Even though I think I will participate in another topic for this event.
@sidicity Love the idea. As a highly agglutinative language user, dealing with current count-based tokenizers is troublesome and doesn’t even seem right. I would love to join this project.
Transfer learning on the encoder side has been explored to great extent in various papers, I am not sure if there are many papers that do transfer learning on the decoder side especially when the decoder is generative. Like if we had a sequence labeling problem, then transfer of labels is possible from one language to another, but in case of generative decoder, that seems a difficult problem, any pointers here @admins ?
It will be important to decide on which languages exactly ByT5 should be pretrained on! Think it’s a good idea to make sure that the languages are related/close to each other.
Putting everybody that is interested in an official project for now
I’m also interested in this one. I have trained multilingual models for indian languages. But we transliterated to a common script and then we used sentencepiece for tokenization. It would be great to use a model that can automatically figure out how to tokenize.
Giving you guys directly two TPUs tomorrow! Split the team randomly into two in the official google sheet, but this shouldn’t change anything - just that you have access to 2 TPU v3-8s
Might make organization a bit easier to split work on two VMs!
Yes, we had setup a Discord channel Flax-HuggingFace-Community-Week . On the TPU, we only received 1 TPU name for our team (which I believe should correspondence to only 1 TPU v3-8 , is the second TPU coming soon or we need to get a second person to put his name in the second slot of the TPU request signing form?