Batch Tansform and accents in the json file

pierreguillou · December 13, 2021, 5:47pm

Hi.

I used the notebook lab2_batch_transform.ipynb of @philschmid to launch batch for inferences. My dataset is composed of texts in Portuguese (ie, with accents).

When I download the json file created and open it with Sublime Text or note, I see that all letters with accents were converted as following:

(...)
{"inputs": "forma\u00e7\u00e3o ...}
(...)

In this example, forma\u00e7\u00e3o is formação.

What do you think? I can use my json file or I need to solve this problem in order to get (real) letters with accents in my json file? Thanks.

philschmid · December 14, 2021, 8:32am

You can encode your json file correctly using

with open('keys.json', encoding='utf-8') as fh:
    data = json.load(fh)

pierreguillou · December 14, 2021, 1:17pm

Thanks @philschmid. I tested your code with encoding='utf-8' but it did not change the content of my json file with strange letters instead of letters with accents.

However, I just found the complementary code in this post that solves my problem: ensure_ascii=False as an argument of json.dump().

Here the code I use now (taken from notebook lab2_batch_transform.ipynb and modified with the 2 cited arguments):

with open(dataset_csv_file, "r+") as infile, open(dataset_jsonl_file, "w+", encoding='utf-8') as outfile:
    reader = csv.DictReader(infile)
    for row in reader:
        json.dump(row, outfile, ensure_ascii=False)
        outfile.write('\n')

Topic		Replies	Views
UTF-16 for datasets? 🤗Datasets	4	1366	June 21, 2023
Data format for text-to-image 🧨 Diffusers	3	1872	January 14, 2023
ClientErro:400 when using batch transformer for inference Amazon SageMaker	11	2220	January 13, 2022
Errors while running a sagemaker batch transform (inference) job Beginners	2	1074	May 15, 2023
Error using 'MultiRecord' in batch transform Amazon SageMaker	2	1221	May 29, 2022

Batch Tansform and accents in the json file

Related topics