ValueError: Integer column has NA values in column 10
I am working with an incomplete CSV dataset! Some of my integer columns have NA values. I was wondering what would be the best way to deal with that. Is there some way to tell load_dataset
about it? I was thinking of making those values -1, or null.
This is my code so far
features = Features({ #'': Value(dtype='int32'),
'ts': Value(dtype='double'), #time
'uid': Value(dtype='string'), #string
'id.orig_h': Value(dtype='string'), #addr
'id.orig_p': Value(dtype='int64'), #port
'id.resp_h': Value(dtype='string'), #addr
'id.resp_p': Value(dtype='int64'), #port
'proto': Value(dtype='string'),#ClassLabel(names=['unknown_transport', 'tcp', 'udp', 'icmp']), #enum
'service': Value(dtype='string'), #string
'duration': Value(dtype='double'), #interval TODO: time interval
'orig_bytes': Value(dtype='int64'), #count
'resp_bytes': Value(dtype='int64'), #count
'conn_state': Value(dtype='string'), #string
'local_orig': Value(dtype='string'), #bool
'local_resp': Value(dtype='string')}) #bool
# Some columns removed from this example for "simplicity"
exclude_column_names = ['ts','uid','id.orig_h', 'id.resp_h', 'local_orig', 'local_resp']
reduced_iot_path = "MYPATH/iot23_combined_new.csv"
reduced_iot_dataset = load_dataset("csv", data_files=reduced_iot_path,
na_values=['-'], features=features).remove_columns(exclude_column_names)
I am working with network traffic data. orig_bytes
can be null sometimes as some network traffic won’t track that. If i make orig_byes
a double, the data set loads, but the null values are now all 0, when 0 and null mean different things. Is there anyway I could tell load_dataset to honor this difference?