Iām running an inference snippet which is based on some seq2seq transformer model off of huggingface model
The code is technically the similar (aside from the custom db) to this.
This means that the pre and post process steps in the code is exactly the same. .
The tokenizer is the models (first link above) and the model is just the original but trained on my custom db (of course before running the predictions)
What is perplexing on running predictions i get this , which is not bad if we remove those special tokens <> and join the words⦠e.g. xss injection malicious is a pretty good keyphrase. Question: Why the letter by letter output and the separating special tokens?? Iām missing something fundamental here.
["<s><s><s>x", "s", "s<category>,", "i", "n", "j", "e", "c", "t", " ", "m", "a", "l", "i<category>c", "i<header>o", "u", "s<infill> ", "c<infill>o", "d", "e<header> ", "i<infill>n", "t<category>o", " <category>l", "o", "n<category>g", "e<category>r", " <infill>s", "u<category>p", "p", "o<category>rt", "e<infill>d", " <seealso>", "s<header>", "s<seealso> ", "f", "i<seealso>", "l<infill>e", " ", "v", "a<infill>", "m<infill>", ""]
["<s><s><s>x", "s", "s<category>,", "c", "r", "o", "s<infill>s", " ", "s<header>i", "t", "e", " <category>s", "c<category>r", "i", "p", "t<header>i<category>n", "g", ",", "i<category>m", "pr", "r<category>o", "p<category>e", "r<infill> ", "u", "s<seealso>e", "d", " <infill>i", "n", "p<infill>u", "t<category> ", "v", "a", "l", "i<infill>d", "a<infill>t", "i<header>", "n<category>", "a<seealso>", "m", "e<category>m<category>", "o<present>"]