Hi,
I am working with DNA data where I have a string of letters which I would like to one-hot encode into a vector.
I have a function like this
bases={ 'A': 0, 'C': 1, 'G': 2, 'T': 3}
comp_bases={ 'A': 3, 'C': 2, 'G': 1, 'T': 0 }
def one_hot(string):
res = np.zeros( (4,len(string)), dtype=np.float32 )
for j in range(len(string)):
if string[j] in bases: # bases can be 'N' signifying missing: this corresponds to all 0 in the encoding
res[ bases[ string[j] ], j ]=float(1.0)
return res
I have then created a simple map function for it
def vectorize_test(example):
example['vec'] = one_hot(example['seq_0'])
return example
I apply the vectorization function via map
updated_dataset = dataset.map(vectorize_test)
but I seem to get a weird datatype which I can’t convert into a torch tensor later on
'vec': Sequence(feature=Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), length=-1, id=None)
Does anyone have any suggestions for how write my function so that it can be easily converted into a torch tensor. I’d like my tensor to be of shape (4, length of sequence) but this datatype doesn’t allow for that.
My goal is to process this data so that I can use it as a dataloader within a PyTorch lightning module.