Research Post

Gene expression based survival prediction for cancer patients—A topic modeling approach

Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient’s cancer, we represent each patient (≈ document) as a mixture over cancer-topics, where each cancer-topic is a mixture over gene expression values (≈ words). This required some extensions to the standard LDA model—e.g., to accommodate the real-valued expression values—leading to our novel discretized Latent Dirichlet Allocation (dLDA) procedure. After using this dLDA to learn these cancer-topics, we can then express each patient as a distribution over a small number of cancer-topics, then use this low-dimensional “distribution vector” as input to a learning algorithm—here, we ran the recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. We initially focus on the METABRIC dataset, which describes each of n = 1,981 breast cancer patients using the r = 49,576 gene expression values, from microarrays. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this “dLDA+MTLR” approach by running it on the n = 883 Pan-kidney (KIPAN) dataset, over r = 15,529 gene expression values—here using the mRNAseq modality—and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent “D-calibrated” measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach. The dLDA+MTLR source code is available at


We greatly acknowledge Compute Canada for providing us with computing resources. We also thank Dream Challenges and Broad Institute for making the METABRIC and KIPAN cancer data sets publicly available.

Latest Research Papers

Connect with the community

Get involved in Alberta's growing AI ecosystem! Speaker, sponsorship, and letter of support requests welcome.

Explore training and advanced education

Curious about study options under one of our researchers? Want more information on training opportunities?

Harness the potential of artificial intelligence

Let us know about your goals and challenges for AI adoption in your business. Our Investments & Partnerships team will be in touch shortly!