Hi everyone,
sorry for potentially posting in the wrong forum (as I am not necessarily proposing a flax-project but rather have Q’s about an ongoing effort) but this was the only flax-HF-related forum I found.
I am currently experimenting with a T5-flax-version pre-trained on protein sequences using MLM.
For this, I already adjusted the great example provided here and successfully trained for a few steps on a protein sequence database using a domain-specific tokenizer (for simplicity: one amino acid is one token).
Now I try to adjust the padding because in its current version the script accumulates multiple sentences by concatenation them. This makes sense for NLP if sentences are from the same document but as protein sequences are usually unrelated, concatenating them will most likely cause problems (proteins split over multiple batches etc.).
Instead of re-inventing the wheel, I wondered whether someone already implemented something like fixed-length padding for the t5_mlm_flax pre-training script (from what I understand, flax prevents me from dynamic batching but optimally, I would simply pad all shorter sequences to max_length instead of concatenating multiple proteins).
While looking into this, I also wondered whether there are examples that implement google’s UL2 pre-training strategy for the t5_mlm_flax pre-training script (from my naive understanding it would “only” require adjusting the random_spans_noise_mask.
As I will first pre-train on MLM using protein sequences and later fine-tune on a translation task, I also wondered whether there are any existing examples of flax-based translation fine-tuning.
I already searched a bit but did not find something that exactly fits what I was looking for so I just wondered whether I overlooked something. If not: totally fine, and I’ll give it a shot.