Dataset curation extra parameters

Frenchizer · January 14, 2025, 3:47pm

Hello,

First post, motivated beginner.
Huggingface is a very handy Website, isn’t it? I love many of its features, especially the database viewer and query tool. I also like the JSON format.
I believe contextualization is a major problem in ML, at least in my line of work, which is translation services. I was wondering if “enhanced” datasets could be read/processed by Huggingface built-in tools AND mostly models (MarianMT, BERT).
By enhanced, I mean added contextualization. As far as I understand, models do not infer contextualization, so I suppose it wouldn’t make much sense to add other parameters than source and target. Still, my idea was to be able to easily reorganize my datasets in sub-datasets according to their context (and other criteria that I haven’t thought about yet).
If this solution isn’t possible, any idea is welcome. Thanks to the community!

John6666 · January 16, 2025, 2:48am

Hello!
I don’t know much about how to pass contextualized data directly to the model…
I think it’s probably not very effective, as you suspect.

However, I think that the enhanced and modified dataset itself can be used for training the model.
The important thing is that it is in Dataset format just before it is passed to the model’s Trainer.
One way is to create a simple subset of each independent dataset based on contextual information. Another way is to implement a dedicated DataCollator for the Trainer and use the dataset as it is, including the context. The context is converted to some format that the model can understand just before it is passed to the model.

Frenchizer · January 19, 2025, 12:52pm

Hi, and thanks for the valuable input,

So a good solution would be to have an uncontextualized general dataset and contextualized datasets created by pipelining context models such as bart-large-nmli. I’d rather choose this solution than the DataCollator for the time being, because my models are untrained and my datasets are incomplete.
Any idea which model would be the best fit? I used bart-large-nmli but the first tests (untrained) were not satisfying. deberta seems a good fit though. I wanna have the broadest range of contexts possible and I am not interested in other linguistic aspects such as style or tone as of now. I believe they can be contextualized to a certain extent anyway.

Topic		Replies	Views
Exploring contexts of occurrence of particular words in large datasets Research	2	821	October 19, 2022
Seeking Guidance on Creating and Training a Model with a Specific Dataset Beginners	4	497	February 2, 2024
A service to translate datasets into other languages 🤗Datasets	1	860	June 6, 2023
Defining a custom dataset for fine-tuning translation Beginners	4	5080	July 10, 2021
Dataset csv creation from markdown file Beginners	3	46	March 31, 2025

Dataset curation extra parameters

Related topics