High-Dimensional Cancer Data Visualization

Biomedical data are increasingly high-dimensional. It is nowadays possible to obtain expressions of tens of thousands of genes from tens of thousands of tissue samples. Such datasets have been compiled, among others, to study the genetic basis of various cancers.

We applied the deep autoencoder (a special type of deep neural network) to the largest publicly available dataset on gene expression of various cancers (Torrente et al., 2016. Identification of cancer related genes using a comprehensive map of human gene expression. PloS One, 11:e0157484.). It is a dataset of tens of thousands of tissue samples of various cancers. To test whether the deep neural network is able to preserve the intrinsic structure of the data, we fed the network with no information about the cancers from which the tissue samples were taken. And yet, samples of the same cancers were clearly combined in a two-dimensional representation. This means that the structure of the high-dimensional data was preserved despite the simplification provided by the deep auto-encoder.