Seq2seq decent predict but letter by letter instead of words

ahadda5 · August 4, 2022, 7:13pm

I’m running an inference snippet which is based on some seq2seq transformer model off of huggingface model

The code is technically the similar (aside from the custom db) to this.
This means that the pre and post process steps in the code is exactly the same. .

The tokenizer is the models (first link above) and the model is just the original but trained on my custom db (of course before running the predictions)

What is perplexing on running predictions i get this , which is not bad if we remove those special tokens <> and join the words… e.g. xss injection malicious is a pretty good keyphrase. Question: Why the letter by letter output and the separating special tokens?? I’m missing something fundamental here.

["<s><s><s>x", "s", "s<category>,", "i", "n", "j", "e", "c", "t", " ", "m", "a", "l", "i<category>c", "i<header>o", "u", "s<infill> ", "c<infill>o", "d", "e<header> ", "i<infill>n", "t<category>o", " <category>l", "o", "n<category>g", "e<category>r", " <infill>s", "u<category>p", "p", "o<category>rt", "e<infill>d", " <seealso>", "s<header>", "s<seealso> ", "f", "i<seealso>", "l<infill>e", " ", "v", "a<infill>", "m<infill>", ""]

["<s><s><s>x", "s", "s<category>,", "c", "r", "o", "s<infill>s", " ", "s<header>i", "t", "e", " <category>s", "c<category>r", "i", "p", "t<header>i<category>n", "g", ",", "i<category>m", "pr", "r<category>o", "p<category>e", "r<infill> ", "u", "s<seealso>e", "d", " <infill>i", "n", "p<infill>u", "t<category> ", "v", "a", "l", "i<infill>d", "a<infill>t", "i<header>", "n<category>", "a<seealso>", "m", "e<category>m<category>", "o<present>"]

ahadda5 · August 5, 2022, 5:37pm

There seems to be something skewed with the output(logits) of the trained model ( the existing huggingface model) on my data , before the inference (.predict() ) step .
They come out like this. Have to go check the .train() step and the fine-tuned model it generates.

ahadda5 · August 9, 2022, 7:48pm

so it turns out the custom dataset i used, was not properly “huggified” and not a proper json, on processing the strings it would add spaces between the letters and as such create a model trained on ‘spaced’ words.

Topic		Replies	Views
Looking for example for seq2seq model Beginners	0	397	December 26, 2022
LaTeX friendly Seq2Seq Model Beginners	0	257	February 13, 2023
Best models for seq2seq tasks 🤗Transformers	3	1124	August 16, 2020
Text generation using custom constraints 🤗Transformers	0	692	August 25, 2022
Train tokenizer for seq2seq model 🤗Tokenizers	0	337	April 19, 2024

Seq2seq decent predict but letter by letter instead of words

Related topics