Hello,
First post, motivated beginner.
Huggingface is a very handy Website, isn’t it? I love many of its features, especially the database viewer and query tool. I also like the JSON format.
I believe contextualization is a major problem in ML, at least in my line of work, which is translation services. I was wondering if “enhanced” datasets could be read/processed by Huggingface built-in tools AND mostly models (MarianMT, BERT).
By enhanced, I mean added contextualization. As far as I understand, models do not infer contextualization, so I suppose it wouldn’t make much sense to add other parameters than source and target. Still, my idea was to be able to easily reorganize my datasets in sub-datasets according to their context (and other criteria that I haven’t thought about yet).
If this solution isn’t possible, any idea is welcome. Thanks to the community!
1 Like
Hello!
I don’t know much about how to pass contextualized data directly to the model…
I think it’s probably not very effective, as you suspect.
However, I think that the enhanced and modified dataset itself can be used for training the model.
The important thing is that it is in Dataset format just before it is passed to the model’s Trainer.
One way is to create a simple subset of each independent dataset based on contextual information. Another way is to implement a dedicated DataCollator for the Trainer and use the dataset as it is, including the context. The context is converted to some format that the model can understand just before it is passed to the model.
Hi, and thanks for the valuable input,
So a good solution would be to have an uncontextualized general dataset and contextualized datasets created by pipelining context models such as bart-large-nmli. I’d rather choose this solution than the DataCollator for the time being, because my models are untrained and my datasets are incomplete.
Any idea which model would be the best fit? I used bart-large-nmli but the first tests (untrained) were not satisfying. deberta seems a good fit though. I wanna have the broadest range of contexts possible and I am not interested in other linguistic aspects such as style or tone as of now. I believe they can be contextualized to a certain extent anyway.