Working on a translator, hoping to do fine-tuning with a utf-16 dataset so I can get all the French accents etc.
Datasets load_dataset() doesn’t seem to like non-utf-8
Is there a way to specify or does it HAVE to be utf-8?
If it has to be utf-8, any suggestions for special characters?
You can pass
encoding="utf-16" to the
“got an unexpected keyword argument ‘encoding’”
my line looks like this
raw_dataset = load_dataset(“json”, data_files=“./train.json”, encoding=“utf-16”)
I’ve merged a PR that adds this param to the JSON builder.
You can access this feature by installing
pip install git+https://github.com/huggingface/datasets.git.
Thanks Mario! You sir are a gentleman and a scholar.