Hi,
I want to avoid this expression is used "... \\\"key\\\"
or other format is used during the âtrainingâ of a model, because the answers of the trained model later shouldnât contain any additional escapes '\\\"key\\\"'
they should just contain only '\"'
.
That is the reason why I want to ensure that the input data for the training will not be modified by the function load_dataset from huggingface datasets.
The escapes for the double quotes â"â inside the user content for the training datasets should not be removed by the datasets library.
- Here is an example of the input format:
"{\"messages": [{"role": "system", "content": "my instructions"}, {"role": "user", "content": "my question"}, {"role": "assistant", "content": "```json\n{\"key\":\"mykey\",\"value\":\"myvalue\"}"\n```}]}"
- Using the
load_dataset
function:
load_dataset('json', data_files='my_input_file.json', field='messages', split="train")
- The result using write the
for data in train_dataset:
print(f"\n{data}")
The resulting format in datasets output is:
{'messages': [{'role': 'system', 'content': 'my instructions'}, {'role": 'user', 'content': 'my question'}, {'role": 'assistant', 'content': '```json\n{"key": "mykey", "value": "myvalue"}\n```'}]}
But I like to ensure that the escapes for the double quotes â"â inside the user content will not be removed by the datasets library.
- I want to have this format.
'```json\n{\"key\": \"mykey\", \"value\": \"myvalue\"}\n```'
- and not this:
'```json\n{"key\": "mykey", "value": "myvalue"}\n```'
Any idea, if there is someone who had the same situation and has a solution that would be awesome?