Hi there, I’m trying to use autotrain in spaces to fine tune a model. The model is the 1b form unfiltered AI, and it seems it is using zephyr as chat template. I created my dataset in the form:
{“chat”:[{“role”:“user”, “content”:“the user text”},{“role”:“assistant”, “content”:“the assistant text”}]}
{“chat”:[{“role”:“user”, “content”:“the user text”},{“role”:“assistant”, “content”:“the assistant text”}]}
…
So it is a JSONL file. I mapped the text to chat in the web interface and set the chat template to tokenizer.
I ran my training and trie dt use it locally with a pipeline and with radio. The result is always the same, the assistant response starts with <<|im_start|>Assistant\n and sometimes I even receive the continuation of the conversation with an automatically generated <|im_start|>User\n.
Before saying that I nee dto postproces the answer, if I us the original unfilteredA! 1B model I do not have this issue.
Weird enough I used a second dataset an dI do not have the sam issue.
Now, I generated myself both dataset,m they are in the same format, different length, and different texts, bu the format is the same, generated by the same code. I simply took some texts from some epub files and split it into user/assistant conversations
1 Like
Small update, ther emust be something wrong with the way the chat template is handled in the tokenizer_config.json. The chat uses chatml but I get the issue if I use “tokenizer” as chat template. If I Use chatml instead, then the inference works. How is possible, though, that the tokenizer is used differently when applying the chat template to the training but not to the inference stage?
1 Like