Padding for T5-flax pre-training on protein sequences

mheinz · November 29, 2022, 3:31pm

Hi everyone,

sorry for potentially posting in the wrong forum (as I am not necessarily proposing a flax-project but rather have Q’s about an ongoing effort) but this was the only flax-HF-related forum I found.
I am currently experimenting with a T5-flax-version pre-trained on protein sequences using MLM.
For this, I already adjusted the great example provided here and successfully trained for a few steps on a protein sequence database using a domain-specific tokenizer (for simplicity: one amino acid is one token).
Now I try to adjust the padding because in its current version the script accumulates multiple sentences by concatenation them. This makes sense for NLP if sentences are from the same document but as protein sequences are usually unrelated, concatenating them will most likely cause problems (proteins split over multiple batches etc.).
Instead of re-inventing the wheel, I wondered whether someone already implemented something like fixed-length padding for the t5_mlm_flax pre-training script (from what I understand, flax prevents me from dynamic batching but optimally, I would simply pad all shorter sequences to max_length instead of concatenating multiple proteins).
While looking into this, I also wondered whether there are examples that implement google’s UL2 pre-training strategy for the t5_mlm_flax pre-training script (from my naive understanding it would “only” require adjusting the random_spans_noise_mask.
As I will first pre-train on MLM using protein sequences and later fine-tune on a translation task, I also wondered whether there are any existing examples of flax-based translation fine-tuning.

I already searched a bit but did not find something that exactly fits what I was looking for so I just wondered whether I overlooked something. If not: totally fine, and I’ll give it a shot.

Topic		Replies	Views
How does huggingface T5 flax pretraining script handles very long sentences? 🤗Transformers	0	364	May 4, 2022
Flan-T5 - Finetuning to a Longer Sequence Length (512 -> 2048 tokens): Will it work? Beginners	3	4195	January 9, 2024
Pretrain T5 from scratch in Dutch Flax/JAX Projects	2	2091	July 7, 2021
BERT pre-training run_mlm_flax.py questions Beginners	0	254	November 3, 2021
T5 instruction finetuning Models	0	48	September 9, 2024

Padding for T5-flax pre-training on protein sequences

Related topics