Encode token without spaced between them

ron5569 · May 9, 2024, 11:03am

I’m working with an LLM that generates files in unified diff format. However, in some cases, the LLM generates invalid output due to spaces between tokens.

For example

--- EventsLiteTestKit.scala 2024-05-31 07:00:00
+++ EventsLiteTestKit.scala 2024-05-31 07:00:01
@@ -39,10 +39,10 @@

 import java.util.UUID
 import scala.collection.concurrent.TrieMap
- import scala.concurrent.Future
+ import scala.concurrent.{Future, _}
 import scala.concurrent.Future.{failed, successful}

 class EventsLiteTestKit {

This is not a valid patch file because of the space between the ‘-’ character and the word ‘import’.
Any ideas on how to force the model to encode such that there are no spaces after the ‘-’ and ‘+’ symbols?

Topic		Replies	Views
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	42	April 22, 2025
Qwen 2.5 coder 7b can't use correct separators Models	1	99	December 16, 2024
Added Tokens Not Decoding with Spaces 🤗Tokenizers	3	2840	January 19, 2024
How to decode with spaces? 🤗Tokenizers	0	1863	April 28, 2022
Why is Code LLama token for prefix, suffix, etc weird underscore character 🤗Transformers	4	1166	October 16, 2023

Encode token without spaced between them

Related topics