Dealing with NA values in Int64 column with load_dataset

19kmunz · October 26, 2023, 3:20pm

ValueError: Integer column has NA values in column 10

I am working with an incomplete CSV dataset! Some of my integer columns have NA values. I was wondering what would be the best way to deal with that. Is there some way to tell load_dataset about it? I was thinking of making those values -1, or null.

This is my code so far

features = Features({ #'': Value(dtype='int32'),
 'ts': Value(dtype='double'), #time
 'uid': Value(dtype='string'), #string
 'id.orig_h': Value(dtype='string'), #addr
 'id.orig_p': Value(dtype='int64'), #port
 'id.resp_h': Value(dtype='string'), #addr
 'id.resp_p': Value(dtype='int64'), #port
 'proto': Value(dtype='string'),#ClassLabel(names=['unknown_transport', 'tcp', 'udp', 'icmp']), #enum
 'service': Value(dtype='string'), #string
 'duration': Value(dtype='double'), #interval TODO: time interval
 'orig_bytes': Value(dtype='int64'), #count
 'resp_bytes': Value(dtype='int64'), #count
 'conn_state': Value(dtype='string'), #string
 'local_orig': Value(dtype='string'), #bool
 'local_resp': Value(dtype='string')}) #bool
# Some columns removed from this example for "simplicity"

exclude_column_names = ['ts','uid','id.orig_h', 'id.resp_h', 'local_orig', 'local_resp']

reduced_iot_path =  "MYPATH/iot23_combined_new.csv"
reduced_iot_dataset = load_dataset("csv", data_files=reduced_iot_path,
                                   na_values=['-'], features=features).remove_columns(exclude_column_names)

I am working with network traffic data. orig_bytes can be null sometimes as some network traffic won’t track that. If i make orig_byes a double, the data set loads, but the null values are now all 0, when 0 and null mean different things. Is there anyway I could tell load_dataset to honor this difference?

Topic		Replies	Views
Describe a nullable/optional column in dataset loading script 🤗Datasets	3	1108	November 12, 2021
How to mark unknown values in ClassLabel with negative numbers? 🤗Datasets	2	125	May 13, 2024
How to make load_dataset interpret "N/A" as a string and not convert to nan? Beginners	2	1428	July 7, 2022
Dataset.from_pandas insist on converting string to int64 🤗Datasets	0	457	July 23, 2024
How do I set feature type when loading dataset(ClassLabel etc)? 🤗Datasets	2	2057	January 19, 2022

Dealing with NA values in Int64 column with load_dataset

Related topics