Using training data for more than one tasks

samkphd31 · January 15, 2021, 2:23pm

Hello, I have a theoretical question regarding training a language model on the same documents across several tasks.

In the context where one works with in-domain specialized data that is small in size (roughly 300MB of text files), I 'd like to know if there are evidences against using the data for both continual pre-training and downstream tasks, such as text classification.
In the above, any document of the corpus would be used to train on Masked Language Modeling task, and later would also be used to fine-tune our model on the text classification task.

Another more general example would be using T5, which is remarkably easy to use in a multitask fashion. By just changing the special token at the beginning of the input, I can hint that the task has changed.
Could I have:
[Task1] doc1
and
[Task2] doc1
in the same training data set? Is there evidence of such maneuver leading to negative learning effects?

P.S: This is my first topic on the forum, please do not hesitate to tell me if my question lacks clarity or is inappropriate in any way

Topic		Replies	Views
Multi-Task Learning to perform two separate classifaction tasks on the same training data Beginners	0	759	May 6, 2021
How to use 1 model for 2 downstream tasks? 🤗Transformers	0	336	May 16, 2022
Can we fine-tune T5 for multiple tasks? 🤗Transformers	0	630	January 24, 2023
Multiple tasks for one fine-tuned LLM Beginners	2	6658	September 18, 2023
Understanding a task, and choosing a model, for text feedback Beginners	0	433	April 1, 2023

Using training data for more than one tasks

Related topics