UTF-16 for datasets?

mtmahe · June 19, 2023, 6:13pm

Working on a translator, hoping to do fine-tuning with a utf-16 dataset so I can get all the French accents etc.

Datasets load_dataset() doesn’t seem to like non-utf-8

Is there a way to specify or does it HAVE to be utf-8?

If it has to be utf-8, any suggestions for special characters?

mariosasko · June 19, 2023, 7:16pm

You can pass encoding="utf-16" to the load_dataset call.

mtmahe · June 19, 2023, 7:28pm

“got an unexpected keyword argument ‘encoding’”

datasets==2.3.2

my line looks like this
raw_dataset = load_dataset(“json”, data_files=“./train.json”, encoding=“utf-16”)

mariosasko · June 21, 2023, 1:34pm

I’ve merged a PR that adds this param to the JSON builder.

You can access this feature by installing datasets from main with pip install git+https://github.com/huggingface/datasets.git.

mtmahe · June 21, 2023, 2:21pm

Thanks Mario! You sir are a gentleman and a scholar.

Topic		Replies	Views
Issues with non-ASCII symbols in Datasets Viewer Site Feedback	1	1097	September 17, 2021
UniDecodeError: 'charmap' codec can't decode byte from Load_dataset Beginners	0	54	December 5, 2024
Random utf-8 errors from dataset Intermediate	10	3346	December 8, 2023
How to ensure that the escapes for the double quotes '\"' inside the 'user content' for the training datasets will not be removed? 🤗Datasets	0	134	April 11, 2024
Datasets.load_datasets fails 🤗Datasets	12	686	October 11, 2024