Error Handling in IterableDataset?

Is there a built in way of handling errors when streaming Datasets? I’ve been trying to stream the RedPajama-1T dataset and have gotten errors across most subsets on multiple occasions (see screenshots for Github and C4 below):

I don’t need all the data, most of it is fine.

If there isn’t a built in way, it’s all good. I’ll look at writing a class that inherits the IterableDataset and handles the issue (unless there is a better way, I’m all ears).

RedPajama-1T uses a custom dataset loading script to download files outside of HF which can lead to unexpected failures. Maybe you can ask authors about why they’re hosting the files on HF directly by opening a discussion: togethercomputer/RedPajama-Data-1T · Discussions

There are retry mechanisms in datasets / huggingface_hub already when streaming files from HF

1 Like

Merci Quentin, I’ll do that.

For anyone looking for an immediate fix, I added error handling to their custom DatasetBuilder class and it looks like a viable workaround (still testing).

I used GitLFS to download their dataset/loader in the directory where my code was. Then modified the RedPajama1T class, specifically the " _generate_examples" functions. It should look like this:

    def _generate_examples(self, files):                                                                                                                                                        
        """This function returns the examples in the raw (text) form."""                                                                                                                        
        key = 0                                                                                                                                                                                 
        errors = []                                                                                                                                                                             
        for subset in files:                                                                                                                                                                    
            if subset == "common_crawl":                                                                                                                                                        
                import zstandard as zstd                                                                                                                                                        
                try:                                                                                                                                                                            
                    for path in files[subset]:                                                                                                                                                  
                        with zstd.open(open(path, "rb"), "rt", encoding="utf-8") as f:                                                                                                          
                            for i, row in enumerate(f):                                                                                                                                         
                                try:                                                                                                                                                            
                                    data = json.loads(row)                                                                                                                                      
                                    text = data["text"]                                                                                                                                         
                                    del data["text"]  
                                    yield key, {                                                                                                                                         
                                        "text": text,                                           
                                        "meta": json.dumps(data),                                                                                                                               
                                        "red_pajama_subset": subset,                                                                                                                            
                                    }                                                           
                                    key += 1                                                    
                                except Exception as e:                                          
                                    print(f'Subset: {subset}')                                                                                                                                  
                                    print(f'Path: {path}')                                                                                                                                      
                                    print(f'Row: {row}')                                                                                                                                        
                                    print(e)                                                    
                except Exception as e:                                                          
                    errors.append(e)                                                            
            else:                               
                for path in files[subset]:                                                      
                    try:                        
                        with open(path, encoding="utf-8") as f:                                                                                                                                 
                            for i, row in enumerate(f):                                                                                                                                         
                                try:                                                            
                                    data = json.loads(row)                                                                                                                                      
                                    if "meta" not in data:                                                                                                                                      
                                        text = data["text"]                                                                                                                                     
                                        del data["text"]                                                                                                                                        
                                        yield key, {                                            
                                            "text": text,                                                                                                                                       
                                            "meta": json.dumps(data),                                                                                                                           
                                            "red_pajama_subset": subset,                                                                                                                        
                                        }                                                       
                                    else:                                                       
                                        yield key, {                                            
                                            "text": data["text"],                                                                                                                               
                                            "meta": data["meta"],                                                                                                                               
                                            "red_pajama_subset": subset,                                                                                                                        
                                        }                                                       
                                    key += 1                                                    
                                except Exception as e:                                          
                                    print(f'Subset: {subset}')                                                                                                                                  
                                    print(f'Path: {path}')                                                                                                                                      
                                    print(f'Row: {row}')                                                                                                                                        
                                    print(e)                                                    
                    except Exception as e:                                                      
                        errors.append(e) 

You can then use “load_dataset” to read the modified dataset. The name of dataset’s directory is RedPajama-Data-1T and was in the same directory as my code (you’ll need to change the path passed in to load_datasets otherwise)

import datasets

rpj_arxiv_dataset= datasets.load_dataset('./RedPajama-Data-1T','arxiv', streaming=True)

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.