Yuan Huang, PhD

In cancer studies with high-throughput “-omics” measurements, the analysis of a single dataset often suffers from a lack of power and poor reproducibility. Integrative analysis of multiple independent datasets provides an effective way of pooling information and outperforms single-dataset and several alternative multi-datasets methods. In this study, we consider penalized variable selection and estimation in integrative analysis. Advancing from the existing studies, we introduce a novel penalty to explicitly encourage the similarity of sparsity structures. This study is motivated by the practical consideration that under many scenarios, multiple datasets are expected to share common important covariates. Theoretically the proposed method has established selection and estimation consistency properties under the high dimensional settings. Numerically the proposed method has identification and estimation performance better than or comparable to the alternatives under a wide spectrum of simulation scenarios. In the analysis of three lung cancer datasets with gene expression measurements, the proposed method identifies genes with sound biological implications and satisfactory prediction performance.

Yuan Huang, PhD