Trying to Build Datasets, Random Items Get Added

muellerzr · July 27, 2021, 1:28pm

I don’t think so, when I tried checking for that I still got 25,000. Or a better way to put that is this returns zero:

from fastcore.xtras import open_file

count = 0
for text in texts:
    t = open_file(text).read()
    if '\n' in t or '\r' in t: count += 1

Of course it would be! However I’m trying to write a high-level data API for adaptnlp currently, so I’m only using IMDB as a situational test case

Edit: Trying a new way to verify, will update with those results

Aha! @sgugger thank you! There were some hidden \x85 characters, which is the source of the breakage.

I can work with that now. Thank you!
(If you have recommendations for fixes, I’m all ears, I was just going to take that into account while mapping labels from folder names)

Topic		Replies	Views
Unable to load all raw text files from dataset Beginners	1	553	November 24, 2022
Undesired behavior when using load_dataset 🤗Datasets	4	949	April 17, 2023
Load_dataset did not load the text file? Beginners	0	651	June 7, 2021
Can load_datasets load entire text files instead of splitting on new lines? Beginners	1	1734	February 14, 2022
Adding items to Dataset is slow compared to loading from Python list 🤗Datasets	1	389	April 3, 2024

Trying to Build Datasets, Random Items Get Added

Related topics