Custom domain specific training text-to-text

Thinkcru · March 11, 2023, 11:58pm

Hell Hugginface exerts,

I am trying to solve a very specific domain problem. I have a project where I must take input text that contains domain specific codes in the text and translate it to a specific output.

I am using ABC and FOO to just convey the basic idea here, because I do not want to distract with problem statement. However, my domain is electrical engineering data. There will be words like: Port, VSS, GND, CHA, etc. I have a PDF document specification that explains these words, do you think I need to throw this into the fine-tuning flow?

In this example, I have an input text, and what I am showing you is the output result that I am trying to produce from the model. This example shows input text → output result I want, for example:
! Port[1] = ABCZ4_FOO1 -model-> ABC_c_FOO1
or
! Port[4] = ABC4_FOO1 -model-> ABC_t_FOO1

There are many variations of this input text, where in some cases it would be ABC3 instead of ABCZ4, so in this case the input/output would be:
! Port[1] = ABC3_FOO1 -model-> ABC_c_FOO1

What I am trying to figure out is what would be the best model transformer, base mode model and then training task. Should I use translation, masking, or something else?

You can see that the input has individual character input tokens, and the sequence of characters gives more context. My instincts is that a transformer makes sense here because “attention” is needed on the specific character tokens to be translated to an output.

Here is a snapshot of a completed table of input/output. You could imagine that I am going to use this data to train the model. There is a ton of variations of possibilities of the input tokens that could be provided with different text. BTW, I have about 700 training input/output samples. Do you think this would be enough or would I need generate more?

Thanks!

dblakely · March 14, 2023, 7:09pm

Just to start, are you positive that the problem can’t be solved with traditional code (e.g., using regexes)? The examples you posted look like they have a sufficiently simple pattern that AI isn’t needed.

Thinkcru · March 14, 2023, 11:15pm

Thanks @dblakely,

Yes I’m sure we will not be able to edit regent. There are so many variations that users can input that I could not put it all in the example. It is something we will really have to develop a model for. Do you have any ideas?

randmodel77 · June 15, 2023, 10:28pm

Just bumping this thread, I’m curious of the possible solutions to this.

Topic		Replies	Views
Domain specific fine tuning Beginners	0	583	September 8, 2022
Tokenizing Domain Specific Text 🤗Tokenizers	5	1440	November 20, 2020
Fine-Tuning a Text2Text Model using different tokenizer 🤗Transformers	5	71	January 20, 2025
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8426	November 14, 2024
Model Performance and Sanity check Intermediate	0	355	March 7, 2024

Custom domain specific training text-to-text

Related topics