Batch Tansform and accents in the json file

Hi.

I used the notebook lab2_batch_transform.ipynb of @philschmid to launch batch for inferences. My dataset is composed of texts in Portuguese (ie, with accents).

When I download the json file created and open it with Sublime Text or note, I see that all letters with accents were converted as following:

(...)
{"inputs": "forma\u00e7\u00e3o ...}
(...)

In this example, forma\u00e7\u00e3o is formação.

What do you think? I can use my json file or I need to solve this problem in order to get (real) letters with accents in my json file? Thanks.

You can encode your json file correctly using

with open('keys.json', encoding='utf-8') as fh:
    data = json.load(fh)

Thanks @philschmid. I tested your code with encoding='utf-8' but it did not change the content of my json file with strange letters instead of letters with accents.

However, I just found the complementary code in this post that solves my problem: ensure_ascii=False as an argument of json.dump().

Here the code I use now (taken from notebook lab2_batch_transform.ipynb and modified with the 2 cited arguments):

with open(dataset_csv_file, "r+") as infile, open(dataset_jsonl_file, "w+", encoding='utf-8') as outfile:
    reader = csv.DictReader(infile)
    for row in reader:
        json.dump(row, outfile, ensure_ascii=False)
        outfile.write('\n')

:slight_smile:

1 Like