Abstract
Understanding the role of differential gene expression in cancer etiology and cellular process is a complex problem that continues to pose a challenge due to sheer number of genes and inter-related biological processes involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to mitigate overfitting of high-dimensionality gene expression data and to facilitate understanding of the associated pathways. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. Here, we proposed to use LDA inclustering as well as in classification of cancer and healthy tissues using lung cancer and breast cancer messenger RNA (mRNA) sequencing data. We describe our study in three phases: clustering, classification, and gene interpretation. First, LDA is used as a clustering algorithm to group the data in an unsupervised manner. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Lastly, we present a functional analysis of the genes identified using a novel topic profile matrix formulation. This analysis identified several genes and pathways that could potentially be involved in differentiating tumor samples from normal. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.
Original language | English |
---|---|
Title of host publication | ACM-BCB 2017 - Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics |
Publisher | Association for Computing Machinery, Inc |
Pages | 388-393 |
Number of pages | 6 |
ISBN (Electronic) | 9781450347228 |
DOIs | |
State | Published - Aug 20 2017 |
Event | 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics - Boston, United States Duration: Aug 20 2017 → Aug 23 2017 Conference number: 8 |
Conference
Conference | 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics |
---|---|
Abbreviated title | ACM-BCB 2017 |
Country/Territory | United States |
City | Boston |
Period | 8/20/17 → 8/23/17 |
ASJC Scopus Subject Areas
- Software
- Biomedical Engineering
- Health Informatics
- Computer Science Applications
Keywords
- Cancer
- Classification
- Clustering
- Gene expression
- Latent Dirichlet Allocation
- Machine learning
- Topic modeling
Disciplines
- Bioinformatics
- Communication Technology and New Media
- Databases and Information Systems
- OS and Networks
- Science and Technology Studies