Hi.
I used the notebook lab2_batch_transform.ipynb of @philschmid to launch batch for inferences. My dataset is composed of texts in Portuguese (ie, with accents).
When I download the json file created and open it with Sublime Text
or note
, I see that all letters with accents were converted as following:
(...)
{"inputs": "forma\u00e7\u00e3o ...}
(...)
In this example, forma\u00e7\u00e3o
is formação
.
What do you think? I can use my json file or I need to solve this problem in order to get (real) letters with accents in my json file? Thanks.
You can encode your json file correctly using
with open('keys.json', encoding='utf-8') as fh:
data = json.load(fh)
Thanks @philschmid. I tested your code with encoding='utf-8'
but it did not change the content of my json file with strange letters instead of letters with accents.
However, I just found the complementary code in this post that solves my problem: ensure_ascii=False
as an argument of json.dump()
.
Here the code I use now (taken from notebook lab2_batch_transform.ipynb and modified with the 2 cited arguments):
with open(dataset_csv_file, "r+") as infile, open(dataset_jsonl_file, "w+", encoding='utf-8') as outfile:
reader = csv.DictReader(infile)
for row in reader:
json.dump(row, outfile, ensure_ascii=False)
outfile.write('\n')

1 Like