A Novel Approach for Classifying Gene Expression Data using Topic Modeling

Soon Jye Kho, Michael L. Raymer, Hima Bindu Yalamanchili, Amit P. Sheth

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Understanding the role of differential gene expression in cancer etiology and cellular process is a complex problem that continues to pose a challenge due to sheer number of genes and inter-related biological processes involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to mitigate overfitting of high-dimensionality gene expression data and to facilitate understanding of the associated pathways. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. Here, we proposed to use LDA inclustering as well as in classification of cancer and healthy tissues using lung cancer and breast cancer messenger RNA (mRNA) sequencing data. We describe our study in three phases: clustering, classification, and gene interpretation. First, LDA is used as a clustering algorithm to group the data in an unsupervised manner. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Lastly, we present a functional analysis of the genes identified using a novel topic profile matrix formulation. This analysis identified several genes and pathways that could potentially be involved in differentiating tumor samples from normal. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.

Original languageEnglish
Title of host publicationACM-BCB 2017 - Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
PublisherAssociation for Computing Machinery, Inc
Pages388-393
Number of pages6
ISBN (Electronic)9781450347228
DOIs
StatePublished - Aug 20 2017
Event8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics - Boston, United States
Duration: Aug 20 2017Aug 23 2017
Conference number: 8

Conference

Conference8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
Abbreviated titleACM-BCB 2017
Country/TerritoryUnited States
CityBoston
Period8/20/178/23/17

ASJC Scopus Subject Areas

  • Software
  • Biomedical Engineering
  • Health Informatics
  • Computer Science Applications

Keywords

  • Cancer
  • Classification
  • Clustering
  • Gene expression
  • Latent Dirichlet Allocation
  • Machine learning
  • Topic modeling

Disciplines

  • Bioinformatics
  • Communication Technology and New Media
  • Databases and Information Systems
  • OS and Networks
  • Science and Technology Studies

Cite this