Steve Horvath

Stefan Horvath

Professor, Human Genetics, University of California Los Angeles

Professor, Psychiatry and Biobehavioral Sciences, University of California Los Angeles

(310) 825-9299

Laboratory Address:
Gonda 5545

Mailing Address:
Department of Human Genetics
Gonda Research Center
David Geffen School of Medicine
695 Charles E. Young Drive South
Box 708822
Los Angeles, CA 90095

Work Address:
Gonda 4357A
Campus - 708822
CA Gonda
Los Angeles, CA 90095 Gonda
Los Angeles, CA 90095

Research Interests

I am heading the Array Data Analysis Group (ADAG) at UCLA, which specializes in the analysis of DNA and tissue microarray data and is comprised of faculty and students in the departments of Human Genetics, Biostatistics, and the Bioinformatics Program. ADAG has 3 missions: education, data analysis and research. Below I highlight some research efforts of my group. Family-based Allelic Association Tests for Finding Complex Disease Genes Family-based allelic association tests (FBAT) are used to determine whether genetic markers are associated with disease occurrence. Family-based tests are attractive because they are robust to population admixture effects. Many complex genetic diseases, e.g., Alzheimer’s disease, have late age of onset so that it can be difficult to obtain the genetic information of the patient’s parents. We developed the sibship disequilibrium test that uses discordant sibships and we collaborated Profs Laird and Xu to develop and implement the family based allelic association test (FBAT) method and software The FBAT method provides haplotype tests for family-based studies that are efficient and robust to population admixture, phenotype distribution specification, and ascertainment based on phenotypes. It can handle missing parental genotypes and/or missing phase in both offspring and parents. It yields either haplotype-specific (univariate) tests or multi-haplotype (global) tests. Tissue Microarray Data: Random Forest Clustering We have been excited about the potential of tissue microarray data for cancer genetics. Tissue microarrays are a new high-throughput tool for the study of protein expression patterns in tissues and are increasingly used to evaluate the diagnostic, prognostic importance of tumor biomarkers. Lack of appropriate statistical methodology have inspired us to develop and apply appropriate data analysis methods. Since it is standard practice in the tumor marker community to use cut-off values for tumor marker expression values, we realized the value of using tree- and forest- based prediction methods for these data. In particular, we have focused on the use of random forest dissimilarities for tumor class discovery and have studied the theoretical properties of a random forest dissimilarity ( The random forest dissimilarity weighs the contribution of each covariate in a natural way: the more related the covariate is to other covariates the more it will affect the definition of the dissimilarity. Dependent markers may correspond to disease pathways, which drive the clinical outcomes of interest. Systems biology: weighted gene co-expression networks High-throughput approaches for analyzing the expressed genome provide an unprecedented opportunity to enhance our understanding of human disease. Identifying disease-associated genes that predict patient survival or that may be therapeutically targeted remains a significant challenge. A relatively new approach to analyzing complex microarray data involves application of graphical network models to identify topological relationships between genes. By elucidating the higher level organizational pattern of gene coexpression networks that regulate cellular phenotype, this approach has the potential to identify key disease genes. We have worked on gene co-expression network methods that can be used to explore the system-level functionality of genes. The gene network construction is conceptually straightforward: nodes represent genes and nodes are connected if the corresponding genes are significantly co-expressed across appropriately chosen tissue samples. In reality, it is tricky to define the connections between the nodes in such networks. An important question is whether it is biologically meaningful to encode gene co-expression using binary information (connected=1, unconnected=0). We have introduced a general framework for `soft’ thresholding that assigns a connection weight to each gene pair. A technical report and an R tutorial can be found here:


Dr Horvath is an aging researcher and bioinformatician whose research lies at the intersection of epidemiology, chronic diseases, epigenetics, genetics, and systems biology. He developed systems biologic approaches such as weighted gene co-expression network analysis. He works on all aspects of biomarker development with a particular focus on genomic biomarkers of aging. He developed a highly accurate multi-tissue biomarker of aging known as the epigenetic clock. Salient features of the epigenetic clock include its high accuracy and its applicability to a broad spectrum of tissues and cell types. He develops and applies methods for analyzing and integrating gene expression-, DNA methylation-, microRNA, genetic marker-, and complex phenotype data. His lab members apply and develop data mining methods to study a broad spectrum of diseases, e.g. aging research, cancer, cardiovascular disease, HIV, Huntington’s disease, neurodegenerative diseases.