I don’t think so, when I tried checking for that I still got 25,000. Or a better way to put that is this returns zero:
from fastcore.xtras import open_file
count = 0
for text in texts:
t = open_file(text).read()
if '\n' in t or '\r' in t: count += 1
Of course it would be! However I’m trying to write a high-level data API for adaptnlp currently, so I’m only using IMDB as a situational test case
Edit: Trying a new way to verify, will update with those results
Aha! @sgugger thank you! There were some hidden \x85
characters, which is the source of the breakage.
I can work with that now. Thank you!
(If you have recommendations for fixes, I’m all ears, I was just going to take that into account while mapping labels from folder names)