mtmahe
1
Working on a translator, hoping to do fine-tuning with a utf-16 dataset so I can get all the French accents etc.
Datasets load_dataset() doesnât seem to like non-utf-8
Is there a way to specify or does it HAVE to be utf-8?
If it has to be utf-8, any suggestions for special characters?
You can pass encoding="utf-16"
to the load_dataset
call.
1 Like
mtmahe
3
âgot an unexpected keyword argument âencodingââ
datasets==2.3.2
my line looks like this
raw_dataset = load_dataset(âjsonâ, data_files=â./train.jsonâ, encoding=âutf-16â)
Iâve merged a PR that adds this param to the JSON builder.
You can access this feature by installing datasets
from main
with pip install git+https://github.com/huggingface/datasets.git
.
mtmahe
5
Thanks Mario! You sir are a gentleman and a scholar. 
1 Like