Hell Hugginface exerts,
I am trying to solve a very specific domain problem. I have a project where I must take input text that contains domain specific codes in the text and translate it to a specific output.
I am using ABC
and FOO
to just convey the basic idea here, because I do not want to distract with problem statement. However, my domain is electrical engineering data. There will be words like: Port
, VSS
, GND
, CHA
, etc. I have a PDF document specification that explains these words, do you think I need to throw this into the fine-tuning flow?
In this example, I have an input text, and what I am showing you is the output result that I am trying to produce from the model. This example shows input text
→ output result I want
, for example:
! Port[1] = ABCZ4_FOO1
-model-> ABC_c_FOO1
or
! Port[4] = ABC4_FOO1
-model-> ABC_t_FOO1
There are many variations of this input text, where in some cases it would be ABC3
instead of ABCZ4
, so in this case the input/output would be:
! Port[1] = ABC3_FOO1
-model-> ABC_c_FOO1
What I am trying to figure out is what would be the best model transformer, base mode model and then training task. Should I use translation, masking, or something else?
You can see that the input has individual character input tokens, and the sequence of characters gives more context. My instincts is that a transformer makes sense here because “attention” is needed on the specific character tokens to be translated to an output.
Here is a snapshot of a completed table of input/output. You could imagine that I am going to use this data to train the model. There is a ton of variations of possibilities of the input tokens that could be provided with different text. BTW, I have about 700 training input/output samples. Do you think this would be enough or would I need generate more?
Thanks!