Error Handling in IterableDataset?

FelixLabelle · February 12, 2024, 12:53pm

Is there a built in way of handling errors when streaming Datasets? I’ve been trying to stream the RedPajama-1T dataset and have gotten errors across most subsets on multiple occasions (see screenshots for Github and C4 below):

I don’t need all the data, most of it is fine.

If there isn’t a built in way, it’s all good. I’ll look at writing a class that inherits the IterableDataset and handles the issue (unless there is a better way, I’m all ears).

lhoestq · February 12, 2024, 2:20pm

RedPajama-1T uses a custom dataset loading script to download files outside of HF which can lead to unexpected failures. Maybe you can ask authors about why they’re hosting the files on HF directly by opening a discussion: togethercomputer/RedPajama-Data-1T · Discussions

There are retry mechanisms in datasets / huggingface_hub already when streaming files from HF

FelixLabelle · February 12, 2024, 4:07pm

Merci Quentin, I’ll do that.

For anyone looking for an immediate fix, I added error handling to their custom DatasetBuilder class and it looks like a viable workaround (still testing).

I used GitLFS to download their dataset/loader in the directory where my code was. Then modified the RedPajama1T class, specifically the " _generate_examples" functions. It should look like this:

    def _generate_examples(self, files):                                                                                                                                                        
        """This function returns the examples in the raw (text) form."""                                                                                                                        
        key = 0                                                                                                                                                                                 
        errors = []                                                                                                                                                                             
        for subset in files:                                                                                                                                                                    
            if subset == "common_crawl":                                                                                                                                                        
                import zstandard as zstd                                                                                                                                                        
                try:                                                                                                                                                                            
                    for path in files[subset]:                                                                                                                                                  
                        with zstd.open(open(path, "rb"), "rt", encoding="utf-8") as f:                                                                                                          
                            for i, row in enumerate(f):                                                                                                                                         
                                try:                                                                                                                                                            
                                    data = json.loads(row)                                                                                                                                      
                                    text = data["text"]                                                                                                                                         
                                    del data["text"]  
                                    yield key, {                                                                                                                                         
                                        "text": text,                                           
                                        "meta": json.dumps(data),                                                                                                                               
                                        "red_pajama_subset": subset,                                                                                                                            
                                    }                                                           
                                    key += 1                                                    
                                except Exception as e:                                          
                                    print(f'Subset: {subset}')                                                                                                                                  
                                    print(f'Path: {path}')                                                                                                                                      
                                    print(f'Row: {row}')                                                                                                                                        
                                    print(e)                                                    
                except Exception as e:                                                          
                    errors.append(e)                                                            
            else:                               
                for path in files[subset]:                                                      
                    try:                        
                        with open(path, encoding="utf-8") as f:                                                                                                                                 
                            for i, row in enumerate(f):                                                                                                                                         
                                try:                                                            
                                    data = json.loads(row)                                                                                                                                      
                                    if "meta" not in data:                                                                                                                                      
                                        text = data["text"]                                                                                                                                     
                                        del data["text"]                                                                                                                                        
                                        yield key, {                                            
                                            "text": text,                                                                                                                                       
                                            "meta": json.dumps(data),                                                                                                                           
                                            "red_pajama_subset": subset,                                                                                                                        
                                        }                                                       
                                    else:                                                       
                                        yield key, {                                            
                                            "text": data["text"],                                                                                                                               
                                            "meta": data["meta"],                                                                                                                               
                                            "red_pajama_subset": subset,                                                                                                                        
                                        }                                                       
                                    key += 1                                                    
                                except Exception as e:                                          
                                    print(f'Subset: {subset}')                                                                                                                                  
                                    print(f'Path: {path}')                                                                                                                                      
                                    print(f'Row: {row}')                                                                                                                                        
                                    print(e)                                                    
                    except Exception as e:                                                      
                        errors.append(e)

You can then use “load_dataset” to read the modified dataset. The name of dataset’s directory is RedPajama-Data-1T and was in the same directory as my code (you’ll need to change the path passed in to load_datasets otherwise)

import datasets

rpj_arxiv_dataset= datasets.load_dataset('./RedPajama-Data-1T','arxiv', streaming=True)

system · February 13, 2024, 4:07am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Recovering IterableDataset state if it crashes mid stream 🤗Datasets	0	31	August 22, 2024
Issue with iterable dataset that is stuck on StopIteration 🤗Datasets	4	223	August 19, 2024
NotImplementedError when solidifying a streaming dataset 🤗Datasets	11	2924	November 23, 2023
Roadmap/timeline for dataset streaming 🤗Datasets	9	2271	July 5, 2021
"too many open files" despite streaming with IterableDataset 🤗Datasets	2	27	January 27, 2025

Error Handling in IterableDataset?

Related topics