mtmahe
1
Working on a translator, hoping to do fine-tuning with a utf-16 dataset so I can get all the French accents etc.
Datasets load_dataset() doesnāt seem to like non-utf-8
Is there a way to specify or does it HAVE to be utf-8?
If it has to be utf-8, any suggestions for special characters?
You can pass encoding="utf-16"
to the load_dataset
call.
2 Likes
mtmahe
3
āgot an unexpected keyword argument āencodingāā
datasets==2.3.2
my line looks like this
raw_dataset = load_dataset(ājsonā, data_files=ā./train.jsonā, encoding=āutf-16ā)
Iāve merged a PR that adds this param to the JSON builder.
You can access this feature by installing datasets
from main
with pip install git+https://github.com/huggingface/datasets.git
.
mtmahe
5
Thanks Mario! You sir are a gentleman and a scholar.
1 Like