UTF-16 for datasets?

Working on a translator, hoping to do fine-tuning with a utf-16 dataset so I can get all the French accents etc.

Datasets load_dataset() doesn’t seem to like non-utf-8

Is there a way to specify or does it HAVE to be utf-8?

If it has to be utf-8, any suggestions for special characters?

You can pass encoding="utf-16" to the load_dataset call.

1 Like

“got an unexpected keyword argument ‘encoding’”


my line looks like this
raw_dataset = load_dataset(“json”, data_files=“./train.json”, encoding=“utf-16”)

I’ve merged a PR that adds this param to the JSON builder.

You can access this feature by installing datasets from main with pip install git+https://github.com/huggingface/datasets.git.

Thanks Mario! You sir are a gentleman and a scholar. :wink:

1 Like