UTF-16 for datasets?

Working on a translator, hoping to do fine-tuning with a utf-16 dataset so I can get all the French accents etc.

Datasets load_dataset() doesnā€™t seem to like non-utf-8

Is there a way to specify or does it HAVE to be utf-8?

If it has to be utf-8, any suggestions for special characters?

You can pass encoding="utf-16" to the load_dataset call.

1 Like

ā€œgot an unexpected keyword argument ā€˜encodingā€™ā€

datasets==2.3.2

my line looks like this
raw_dataset = load_dataset(ā€œjsonā€, data_files=ā€œ./train.jsonā€, encoding=ā€œutf-16ā€)

Iā€™ve merged a PR that adds this param to the JSON builder.

You can access this feature by installing datasets from main with pip install git+https://github.com/huggingface/datasets.git.

Thanks Mario! You sir are a gentleman and a scholar. :wink:

1 Like