How to ensure that the escapes for the double quotes '\"' inside the 'user content' for the training datasets will not be removed?

tnms · April 11, 2024, 12:00pm

Hi,

I want to avoid this expression is used "... \\\"key\\\" or other format is used during the “training” of a model, because the answers of the trained model later shouldn’t contain any additional escapes '\\\"key\\\"' they should just contain only '\"' .

That is the reason why I want to ensure that the input data for the training will not be modified by the function load_dataset from huggingface datasets.
The escapes for the double quotes ‘"’ inside the user content for the training datasets should not be removed by the datasets library.

Here is an example of the input format:

"{\"messages": [{"role": "system", "content": "my instructions"}, {"role": "user", "content": "my question"}, {"role": "assistant", "content": "```json\n{\"key\":\"mykey\",\"value\":\"myvalue\"}"\n```}]}"

Using the load_dataset function:

load_dataset('json', data_files='my_input_file.json', field='messages', split="train")

The result using write the

for data in train_dataset:
    print(f"\n{data}")

The resulting format in datasets output is:

{'messages': [{'role': 'system', 'content': 'my instructions'}, {'role": 'user', 'content': 'my question'}, {'role": 'assistant', 'content': '```json\n{"key": "mykey", "value": "myvalue"}\n```'}]}

But I like to ensure that the escapes for the double quotes ‘"’ inside the user content will not be removed by the datasets library.

I want to have this format.

'```json\n{\"key\": \"mykey\", \"value\": \"myvalue\"}\n```'

and not this:

'```json\n{"key\": "mykey", "value": "myvalue"}\n```'

Any idea, if there is someone who had the same situation and has a solution that would be awesome?

Topic		Replies	Views
UTF-16 for datasets? 🤗Datasets	4	1383	June 21, 2023
Creating QA Data Set for Distilbert 🤗Datasets	1	700	February 17, 2023
Get_dataset_config_names not getting desired output (and DatasetGenerationError) 🤗Datasets	5	92	December 11, 2024
Dataset loses format (/n) Beginners	0	113	April 27, 2024
Problem reading my own dataset 🤗Datasets	0	206	May 26, 2024

How to ensure that the escapes for the double quotes '\"' inside the 'user content' for the training datasets will not be removed?

Related topics