TypeError: Couldn't cast array of type int64 to null

c-bone · February 4, 2025, 12:47pm

Hello everyone!

I am currently training a GPT-2 model from scratch using my own cif-tokenizer. The goal is to be able to generate crystallographic information files using an LLM. Since some of the CIFs have more tokens than context length, I am using strided tokenization with returned overflowing tokens and padded to max context length. This method has worked for all my datasets (different materials but same format) - however, using the MP-20 dataset I am getting a ‘TypeError: Couldn’t cast array of type int64 to null’ when performing the dataset.map( function.

Data type of first example processed in train set:

Split: train                                                                                                                         
Number of examples: 27136                                                                                                            
Column names: ['Database', 'Material ID', 'Reduced Formula', 'CIF']                                                                  
Features: {'Database': Value(dtype='string', id=None), 'Material ID': Value(dtype='string', id=None), 'Reduced Formula': Value(dtype=
'string', id=None), 'CIF': Value(dtype='string', id=None)}                                                                           
                                                                                                                                     
First example:                                                                                                                       
{'Database': 'MP-20', 'Material ID': 'mp-1221227', 'Reduced Formula': 'Na3MnCoNiO6', 'CIF': "data_Na6Mn2Co2Ni2O12\nloop_\n _atom_type
_symbol\n _atom_type_electronegativity\n _atom_type_radius\n _atom_type_ionic_radius\n  Na  0.9300  1.8000  1.1600\n [...] \n"}

Dataloading script:

def tokenize_function(examples, tokenizer, context_length, stride):
    '''
    Tokenize a dataset using a sliding window approach.
    Processes the "CIF" column in a batch of examples.
    '''
    # Add BOS and EOS tokens to each CIF sequence
    bos_token = tokenizer.bos_token
    eos_token = tokenizer.eos_token

    # Tokenize the CIF column with BOS and EOS tokens
    return tokenizer(
        [bos_token + example + eos_token for example in examples["CIF"]],
        truncation=True,
        max_length=context_length,
        padding="max_length",
        stride=stride,
        return_overflowing_tokens=True,
        return_special_tokens_mask=True,
        return_offsets_mapping=False
    )

def load_data(tokenizer, dataset, context_length, stride, dataset_streaming=False):
    

    # Wrap the tokenize function to pass the tokenizer
    def tokenize_wrapper(examples):
        return tokenize_function(examples, tokenizer, context_length, stride)
    
    # Apply tokenization to the dataset
    tokenized_dataset = dataset.map(
        tokenize_wrapper,         # Pass the wrapped tokenize function
        batched=True,                    # Apply to batches of examples rather than indiv
        remove_columns=["CIF"],          # Remove the original "CIF" column, bc model only needs tokenized data
        # Only use num_proc if not streaming 
        **({"num_proc": 8} if not dataset_streaming else {})
    )
    
    # Create a data collator for training
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False                        # No masked language modeling
    )

    return tokenized_dataset, data_collator

Full error:

Conditionality is not activated                                                                                                                     
Map (num_proc=8):  91%|██████████████████████████████████████████████████████████████████████████████████████▋        | 24744/27136 [00:04<00:00, 52
00.22 examples/s]                                                                                                                                   
multiprocess.pool.RemoteTraceback:                                                                                                                  
"""                                                                                                                                                 
Traceback (most recent call last):                                                                                                                  
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker                      
    result = (True, func(*args, **kwds))                                                                                                            
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_qu
eue                                                                                                                                                 
    for i, result in enumerate(func(**kwargs)):                                                                                                     
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3499, in _map_single           
    writer.write_batch(batch)                                                                                                                       
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 605, in write_batch             
    arrays.append(pa.array(typed_sequence))                                                                                                         
  File "pyarrow/array.pxi", line 250, in pyarrow.lib.array                                                                                          
  File "pyarrow/array.pxi", line 114, in pyarrow.lib._handle_arrow_array_protocol                                                                   
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 243, in __arrow_array__         
    out = cast_array_to_feature(                                                                                                                    
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/table.py", line 1797, in wrapper                       
    return func(array, *args, **kwargs)                                                                                                             " 12:44 04-Feb-25
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/table.py", line 2065, in cast_array_to_feature
    casted_array_values = _c(array.values, feature.feature)                                                                                         
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/table.py", line 1797, in wrapper
    return func(array, *args, **kwargs)                                   
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/table.py", line 2102, in cast_array_to_feature
    return array_cast(                                                    
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/table.py", line 1797, in wrapper
    return func(array, *args, **kwargs)                                   
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/table.py", line 1948, in array_cast
    raise TypeError(f"Couldn't cast array of type {_short_str(array.type)} to {_short_str(pa_type)}")
TypeError: Couldn't cast array of type int64 to null   

The above exception was the direct cause of the following exception:                                                                                

Traceback (most recent call last):                                        
  File "/home/cyprien/CrystaLLMv2_CG/_train.py", line 194, in <module>                                                                              
    main()                                                                
  File "/home/cyprien/CrystaLLMv2_CG/_train.py", line 96, in main                                                                                   
    tokenized_dataset, data_collator = load_data(                                                                                                   
  File "/home/cyprien/CrystaLLMv2_CG/_dataloader.py", line 106, in load_data                                                                        
    tokenized_dataset = dataset.map(                                      
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/dataset_dict.py", line 886, in map
    {                                                                     
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/dataset_dict.py", line 887, in <dictcomp>
    k: dataset.map(                                                       
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                              
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3165, in map
    for rank, done, content in iflatmap_unordered(                                                                                                  
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 718, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]                                                                              
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 718, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]                                                                              
  File "/home/cyprien/miniconda3/envs/crystallmv2_venv/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value                                                     
TypeError: Couldn't cast array of type int64 to null

If I adjust the context length to be >max_token_length, I do not get an error. I have also tried specifying explicitly the features as per some of the posts on this forum, but I am unsure if I am doing it correctly.
I was wondering if anyone knows how I could tackle this problem? I am still quite new to HuggingFace so struggling to grasp what has gone wrong for this dataset. Thank you in advance for the help.

John6666 · February 5, 2025, 11:53am

I found a similar case, but it’s different…

c-bone · February 6, 2025, 9:49am

Hi, thanks for the help. I did see these before posting, but in my case all the entries strings so I don’t have these different datatypes. the only places I would have different datatypes would be in the return overflowing token mappings which I removed, at least to my understanding. So I’m not sure how I could deal with this.

John6666 · February 6, 2025, 1:52pm

To begin with, Python’s language specification includes None but not null, so I think that null probably comes from pandas or pyarrow.
In pandas, for example, just having “” creates null. I think that such data is causing problems when implicit conversion is performed inside the datasets library. On the other hand, int64 is probably data that has been successfully tokenized.

github.com/huggingface/datasets

"Couldn't cast array of type" in complex datasets

opened 02:16PM - 19 Jun 23 UTC

closed 03:13PM - 26 Jul 23 UTC

piercefreeman

### Describe the bug When doing a map of a dataset with complex types, sometime…s `datasets` is unable to interpret the valid schema of a returned datasets.map() function. This often comes from conflicting types, like when both empty lists and filled lists are competing for the same field value. This is prone to happen in batch mapping, when the mapper returns a sequence of null/empty values and other batches are non-null. A workaround is to manually cast the new batch to a pyarrow table (like implemented in this [workaround](https://github.com/piercefreeman/lassen/pull/3)) but it feels like this ideally should be solved at the core library level. Note that the reproduction case only throws this error if the first datapoint has the empty list. If it is processed later, datasets already detects its representation as list-type and therefore allows the empty list to be provided. ### Steps to reproduce the bug A trivial reproduction case: ```python from typing import Iterator, Any import pandas as pd from datasets import Dataset def batch_to_examples(batch: dict[str, list[Any]]) -> Iterator[dict[str, Any]]: for i in range(next(iter(lengths))): yield {feature: values[i] for feature, values in batch.items()} def examples_to_batch(examples) -> dict[str, list[Any]]: batch = {} for example in examples: for feature, value in example.items(): if feature not in batch: batch[feature] = [] batch[feature].append(value) return batch def batch_process(examples, explicit_schema: bool): new_examples = [] for example in batch_to_examples(examples): new_examples.append(dict(texts=example["raw_text"].split())) return examples_to_batch(new_examples) df = pd.DataFrame( [ {"raw_text": ""}, {"raw_text": "This is a test"}, {"raw_text": "This is another test"}, ] ) dataset = Dataset.from_pandas(df) # datasets won't be able to typehint a dataset that starts with an empty example. with pytest.raises(TypeError, match="Couldn't cast array of type"): dataset = dataset.map( batch_process, batched=True, batch_size=1, num_proc=1, remove_columns=dataset.column_names, ) ``` This results in crashes like: ```bash File "/Users/piercefreeman/Library/Caches/pypoetry/virtualenvs/example-9kBqeSPy-py3.11/lib/python3.11/site-packages/datasets/table.py", line 1819, in wrapper return func(array, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/piercefreeman/Library/Caches/pypoetry/virtualenvs/example-9kBqeSPy-py3.11/lib/python3.11/site-packages/datasets/table.py", line 2109, in cast_array_to_feature return array_cast(array, feature(), allow_number_to_str=allow_number_to_str) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/piercefreeman/Library/Caches/pypoetry/virtualenvs/example-9kBqeSPy-py3.11/lib/python3.11/site-packages/datasets/table.py", line 1819, in wrapper return func(array, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/piercefreeman/Library/Caches/pypoetry/virtualenvs/example-9kBqeSPy-py3.11/lib/python3.11/site-packages/datasets/table.py", line 1998, in array_cast raise TypeError(f"Couldn't cast array of type {array.type} to {pa_type}") TypeError: Couldn't cast array of type string to null ``` ### Expected behavior The code should successfully map and create a new dataset without error. ### Environment info Mac OSX, Linux

Topic		Replies	Views
TypeError: Couldn't cast array of type int64 while mapping the dataset 🤗Datasets	6	5692	March 22, 2023
Strange Error While Attempting to Load DataSet 🤗Datasets	7	3588	March 28, 2025
Dataset map() raises value error when mapping list to dict-like class 🤗Datasets	6	106	August 15, 2024
Dataset.map returns error: pyarrow.lib.ArrowInvalid: cannot mix list and non-list, non-null values 🤗Datasets	1	1535	January 17, 2025
multiprocess.pool.RemoteTraceback and TypeError: Couldn't cast array of type string to null when loading Hugging Face dataset Beginners	1	132	September 24, 2024

TypeError: Couldn't cast array of type int64 to null

Related topics