Error Handling in IterableDataset?

Merci Quentin, I’ll do that.

For anyone looking for an immediate fix, I added error handling to their custom DatasetBuilder class and it looks like a viable workaround (still testing).

I used GitLFS to download their dataset/loader in the directory where my code was. Then modified the RedPajama1T class, specifically the " _generate_examples" functions. It should look like this:

    def _generate_examples(self, files):                                                                                                                                                        
        """This function returns the examples in the raw (text) form."""                                                                                                                        
        key = 0                                                                                                                                                                                 
        errors = []                                                                                                                                                                             
        for subset in files:                                                                                                                                                                    
            if subset == "common_crawl":                                                                                                                                                        
                import zstandard as zstd                                                                                                                                                        
                try:                                                                                                                                                                            
                    for path in files[subset]:                                                                                                                                                  
                        with zstd.open(open(path, "rb"), "rt", encoding="utf-8") as f:                                                                                                          
                            for i, row in enumerate(f):                                                                                                                                         
                                try:                                                                                                                                                            
                                    data = json.loads(row)                                                                                                                                      
                                    text = data["text"]                                                                                                                                         
                                    del data["text"]  
                                    yield key, {                                                                                                                                         
                                        "text": text,                                           
                                        "meta": json.dumps(data),                                                                                                                               
                                        "red_pajama_subset": subset,                                                                                                                            
                                    }                                                           
                                    key += 1                                                    
                                except Exception as e:                                          
                                    print(f'Subset: {subset}')                                                                                                                                  
                                    print(f'Path: {path}')                                                                                                                                      
                                    print(f'Row: {row}')                                                                                                                                        
                                    print(e)                                                    
                except Exception as e:                                                          
                    errors.append(e)                                                            
            else:                               
                for path in files[subset]:                                                      
                    try:                        
                        with open(path, encoding="utf-8") as f:                                                                                                                                 
                            for i, row in enumerate(f):                                                                                                                                         
                                try:                                                            
                                    data = json.loads(row)                                                                                                                                      
                                    if "meta" not in data:                                                                                                                                      
                                        text = data["text"]                                                                                                                                     
                                        del data["text"]                                                                                                                                        
                                        yield key, {                                            
                                            "text": text,                                                                                                                                       
                                            "meta": json.dumps(data),                                                                                                                           
                                            "red_pajama_subset": subset,                                                                                                                        
                                        }                                                       
                                    else:                                                       
                                        yield key, {                                            
                                            "text": data["text"],                                                                                                                               
                                            "meta": data["meta"],                                                                                                                               
                                            "red_pajama_subset": subset,                                                                                                                        
                                        }                                                       
                                    key += 1                                                    
                                except Exception as e:                                          
                                    print(f'Subset: {subset}')                                                                                                                                  
                                    print(f'Path: {path}')                                                                                                                                      
                                    print(f'Row: {row}')                                                                                                                                        
                                    print(e)                                                    
                    except Exception as e:                                                      
                        errors.append(e) 

You can then use “load_dataset” to read the modified dataset. The name of dataset’s directory is RedPajama-Data-1T and was in the same directory as my code (you’ll need to change the path passed in to load_datasets otherwise)

import datasets

rpj_arxiv_dataset= datasets.load_dataset('./RedPajama-Data-1T','arxiv', streaming=True)