Pre-train a language model for non-autoregressive generation at scale.
Recent pre-trained language models, e.g. BERT, ELECTRA, boost the performance of various natural language tasks. These approaches focused on scaling up massive training data and efficient pre-training. However, the meaning of their objective is not clear and prevents the further step beyond them. Therefore, we have to design another way to build up pre-trained language model with better interpretability.
The ability to generate language could improve the interpretability of a language model by inspecting the generated results. Some language models with generating language in an auto-regressive fashion, e.g. XLNet, are proposed, but it generates tokens one by one, which incurs slow decoding process.
In this project, we implement a (1) theoretically motivated language model with (2) highly parallelized model training while (3) enabling fast decoding scheme. We follow up a recent paper and move this technique to the pre-training stage. We believe that this model could provides better understanding about the research community while efficiently utilizing state-of-the-art accelerator such as TPUv3.
Transformer layers with conditional random fields. The detailed architecture is inside this paper.
The model architecture huggingface provided is used and the additional conditional random field layer will be implemented as a part of our project.
The same dataset trained for BERT. It would be wikipedia dataset.
Training scripts will be created as part of the project
A new pre-trained architecture that generates long language quickly with non-autoregressive decoding.