Trying to Build Datasets, Random Items Get Added

muellerzr · July 27, 2021, 1:00pm

Hi all,

I’m currently trying to load in fastai’s version of the IMDB dataset, to learn how to build a Dataset from a folder of .txt's. I’m preparing my data with the following:

# Downloading the dataset
from fastai.data.external import untar_data, URLs
from fastai.data.transforms import get_files

path = untar_data(URLs.IMDB, dest='./IMDB')

From there I can get all the training .txt's with:

texts = get_files(path/'train', extensions='.txt')
texts = [str(t) for t in texts]

In turn, this is a list of 25,000 text files. However, when I use the load_dataset api to bring this in, suddenly my dataset has 25,682 items! Can anyone help me figure out why? This is an issue as I need to use add_column to add in a label, and there is a mismatch between the number of actual training items vs the ones Datasets picked up. Here is how I’m building the dataset:

dset = load_dataset('text', data_files={'train':texts})

TIA!

sgugger · July 27, 2021, 1:20pm

I think it’s because the “text” loader creates a new sample for each “\n” it sees, so the texts you have that contain some of those are then split into several samples. @lhoestq or @albertvillanova if you could conifrm?

PS: it would be easier to just do dsets = load_dataset('imdb') :-p

muellerzr · July 27, 2021, 1:28pm

I don’t think so, when I tried checking for that I still got 25,000. Or a better way to put that is this returns zero:

from fastcore.xtras import open_file

count = 0
for text in texts:
    t = open_file(text).read()
    if '\n' in t or '\r' in t: count += 1

Of course it would be! However I’m trying to write a high-level data API for adaptnlp currently, so I’m only using IMDB as a situational test case

Edit: Trying a new way to verify, will update with those results

Aha! @sgugger thank you! There were some hidden \x85 characters, which is the source of the breakage.

I can work with that now. Thank you!
(If you have recommendations for fixes, I’m all ears, I was just going to take that into account while mapping labels from folder names)

Topic		Replies	Views
Don't know how to split imdb to train, test, validation 🤗Datasets	0	331	May 6, 2024
Can I download the raw text of a dataset? Beginners	2	1584	January 3, 2022
Sharing ArrowDataset with subfolders 🤗Datasets	8	31	March 11, 2025
Load_dataset('csv', data_files='./imdb.csv') [Errno 2] No such file or directory: './imdb.csv' 🤗Datasets	2	345	November 29, 2023
Undesired behavior when using load_dataset 🤗Datasets	4	945	April 17, 2023

Trying to Build Datasets, Random Items Get Added

Related topics