How to ensure that the escapes for the double quotes '\"' inside the 'user content' for the training datasets will not be removed?

Hi,

I want to avoid this expression is used "... \\\"key\\\" or other format is used during the “training” of a model, because the answers of the trained model later shouldn’t contain any additional escapes '\\\"key\\\"' they should just contain only '\"' .

That is the reason why I want to ensure that the input data for the training will not be modified by the function load_dataset from huggingface datasets.
The escapes for the double quotes ‘"’ inside the user content for the training datasets should not be removed by the datasets library.

  • Here is an example of the input format:
"{\"messages": [{"role": "system", "content": "my instructions"}, {"role": "user", "content": "my question"}, {"role": "assistant", "content": "```json\n{\"key\":\"mykey\",\"value\":\"myvalue\"}"\n```}]}"
  • Using the load_dataset function:
load_dataset('json', data_files='my_input_file.json', field='messages', split="train")
  • The result using write the
for data in train_dataset:
    print(f"\n{data}")

The resulting format in datasets output is:

{'messages': [{'role': 'system', 'content': 'my instructions'}, {'role": 'user', 'content': 'my question'}, {'role": 'assistant', 'content': '```json\n{"key": "mykey", "value": "myvalue"}\n```'}]}

But I like to ensure that the escapes for the double quotes ‘"’ inside the user content will not be removed by the datasets library.

  • I want to have this format.
'```json\n{\"key\": \"mykey\", \"value\": \"myvalue\"}\n```'
  • and not this:
'```json\n{"key\": "mykey", "value": "myvalue"}\n```'

Any idea, if there is someone who had the same situation and has a solution that would be awesome?