Dataset labels with more than one identifier

I am creating a dataset that uses genes as labels. I would like to store two sets of label identifiers / names with the datasets. Is there a way to do this?

This would be multiple identifiers for the same labels rather than two separate sets of labels attached to the dataset. For each gene, I have NCBI’s gene IDs and gene symbols. Gene IDs are useful for linking to other datasets but are less human friendly (i.e. I have no idea what gene 294928 is). Gene symbols are more meaningful but less formalized, the same gene ID may be associated with multiple symbols and, annoyingly, the same symbol can be used for multiple IDs.

I am using a Sequence of ClassLabels to store the labels. I think deduplicated symbols makes more sense here since they are naturally strings and names of labels whereas gene IDs are integers. I could use the gene IDs as the label values, but these are non-sequential so I think it’s easier to create an index from the gene IDs to prevent large sequences of always empty columns when working with the labels as a binary array.

Ideally, I would like to add the gene IDs as an array that is stored in the dataset as metadata. So the gene ID of label index i would be obtained as gene_ids[i]. Any thoughts on the best way to store this extra information? I see I can set the name value of ClassLabel to arbitrary data types (potentially zip(gene_ids, symbols)) but that seems non-intuiteve and likely not supposed to be allowed.

1 Like

I can attach arbitrary attributes to ClassLabel.feature. Would that be a reasonable place to stores these data? Something like datasetfeature["labels"].feature.gene_ids = gene_ids? Where the "labels" feature would be a Sequence of ClassLabels.

Would this be lost when uploading to the dataset hub?

1 Like

Basically, I don’t think anything you upload will be lost, but it’s better to have it in a correct or easy-to-use data set format…:thinking:
There is a way to output Python dict in JSON and upload it separately, but it’s not smart.
Is there a better way? @lhoestq

I ended up saving the symbols list with JSON for now. It works fine for my use case since I’m already converting the dataset to my data loader and can add reading in the symbols to the conversion function. But if anyone else were to use the dataset it would be on them to either find their own symbols or manually read them in.

Also adding additional attributes to ClassLabel’s features didn’t work since the save_to_disk method didn’t know to look for them.

1 Like