Vectorization of a DNA sequence

clakhani · July 8, 2022, 9:41pm

Hi,

I am working with DNA data where I have a string of letters which I would like to one-hot encode into a vector.

I have a function like this

bases={ 'A':  0, 'C': 1, 'G': 2, 'T': 3}
comp_bases={ 'A': 3, 'C': 2, 'G': 1, 'T': 0 }

def one_hot(string):
    res = np.zeros( (4,len(string)), dtype=np.float32 )
    for j in range(len(string)):
        if string[j] in bases: # bases can be 'N' signifying missing: this corresponds to all 0 in the encoding
            res[ bases[ string[j] ], j ]=float(1.0)
    return res

I have then created a simple map function for it

def vectorize_test(example):
    example['vec'] = one_hot(example['seq_0'])
    return example

I apply the vectorization function via map

updated_dataset = dataset.map(vectorize_test)

but I seem to get a weird datatype which I can’t convert into a torch tensor later on

'vec': Sequence(feature=Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), length=-1, id=None)

Does anyone have any suggestions for how write my function so that it can be easily converted into a torch tensor. I’d like my tensor to be of shape (4, length of sequence) but this datatype doesn’t allow for that.

My goal is to process this data so that I can use it as a dataloader within a PyTorch lightning module.

mariosasko · July 13, 2022, 2:32pm

Hi! You can use the Array2D feature type to represent such data. Note that one limitation of this type is that only the first dimension can be dynamic, so to circumvent this, you should either transpose the arrays or add padding to the length of sequence dimension (we plan to remove this limitation very soon).

Topic		Replies	Views
Use one-hot encoding as input for T5 and GPT Models	1	1319	December 22, 2021
TypeError: Couldn't cast array of type int64 to Sequence Models	0	792	August 19, 2022
Set_format('torch') returns lists of tensors for multiple-entries sample 🤗Datasets	2	480	November 11, 2022
Create dataset consisting of numpy arrays, Sequence or ArrayND? 🤗Datasets	1	151	October 24, 2024
Dataset set_format 🤗Datasets	11	10384	November 24, 2024

Vectorization of a DNA sequence

Related topics