Vectorization of a DNA sequence

Hi,

I am working with DNA data where I have a string of letters which I would like to one-hot encode into a vector.

I have a function like this

bases={ 'A':  0, 'C': 1, 'G': 2, 'T': 3}
comp_bases={ 'A': 3, 'C': 2, 'G': 1, 'T': 0 }

def one_hot(string):
    res = np.zeros( (4,len(string)), dtype=np.float32 )
    for j in range(len(string)):
        if string[j] in bases: # bases can be 'N' signifying missing: this corresponds to all 0 in the encoding
            res[ bases[ string[j] ], j ]=float(1.0)
    return res

I have then created a simple map function for it

def vectorize_test(example):
    example['vec'] = one_hot(example['seq_0'])
    return example

I apply the vectorization function via map

updated_dataset = dataset.map(vectorize_test)

but I seem to get a weird datatype which I can’t convert into a torch tensor later on

'vec': Sequence(feature=Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), length=-1, id=None)

Does anyone have any suggestions for how write my function so that it can be easily converted into a torch tensor. I’d like my tensor to be of shape (4, length of sequence) but this datatype doesn’t allow for that.

My goal is to process this data so that I can use it as a dataloader within a PyTorch lightning module.

Hi! You can use the Array2D feature type to represent such data. Note that one limitation of this type is that only the first dimension can be dynamic, so to circumvent this, you should either transpose the arrays or add padding to the length of sequence dimension (we plan to remove this limitation very soon).