I am loading a text file which contains various languages and emojis, characters, etc.
Sample line in the text file: рдЬрд╝рд╛рдЧрд╛рди, рдкрд╢реНрдЪрд┐рдореА рдкреЛрд▓реИрдВрдб рдореЗрдВ рдПрдХ рд╢рд╣рд░ рд╣реИред рдпрд╣ рд╢рд╣рд░ рдмреЛрдмрд░ рдирджреА рдХреЗ рдХрд┐рдирд╛рд░реЗ рд╕реНрдерд┐рдд рд╣реИ
When loading the file with load_dataset
, it is giving back some unicode decoding error.
Code I used:
from datasets import load_dataset
dataset = load_dataset('text', data_files="/Folder/path/to/file/file.txt")
Error:
Traceback (most recent call last):
File "vocab.py", line 4, in <module>
dataset = load_dataset('text', data_files="/Folder/path/to/file/file.txt")
File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 1702, in load_dataset
builder_instance.download_and_prepare(
File "/opt/conda/lib/python3.8/site-packages/datasets/builder.py", line 594, in download_and_prepare
self._download_and_prepare(
File "/opt/conda/lib/python3.8/site-packages/datasets/builder.py", line 683, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/opt/conda/lib/python3.8/site-packages/datasets/builder.py", line 1133, in _prepare_split
for key, table in utils.tqdm(
File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1173, in __iter__
for obj in iterable:
File "/opt/conda/lib/python3.8/site-packages/datasets/packaged_modules/text/text.py", line 60, in _generate_tables
batch = f.read(self.config.chunksize)
File "/opt/conda/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data
How to avoid this error while loading text file containing multiple language characters ?