The Diabetes Location, Environmental Attributes, and Disparities (LEAD) Network is a CDC-funded collaboration across several sites, concerned with understanding the role of community-level factors in the geographic differences in diabetes incidence and prevalence across the US and across demographic groups. Of primary interest to the Network is understanding the association of neighborhood socioeconomic environment with diabetes onset, and the impact of a variety of mediators on this relationship. Over the last year, the Data Coordinating Center at Drexel has led development of a number of measures to be used in this analysis. In this talk, I will describe the challenges of developing national exposure measures, some of the statistical issues that arise when doing so, and some of the statistical questions that arise in the analysis of these data.
“The Applicability of Rank Normalization to Microbiome Data” (Advisor: Kai Wang)
“Genome-Wide Association Study of Sex Differences in Brain MRI Scans and Its Genetic Correlation to Psychiatric Disorders” (Advisor: Patrick Breheny)
“Machine Learning Approaches to Classify Post-Transplant Disease Status in Hodgkin’s Lymphoma Patients” (Advisor: Brian J. Smith)
“Assessing Performance on a Time-Gated Word Recognition Task from a Longitudinal Study on Children who are Hard of Hearing” (Advisor: Jacob Oleson)
“Modified Rule-Based Phase 1 Clinical Trials” (Advisor: Xian Jin Xie)
“Quantifying Drug Interactions” (Advisor: Xian Jin Xie)
“Effectiveness of Community Guide Preventive Approaches to Promote Physical Activity in a Midwestern Micropolitan City” (Advisor: Daniel Sewell)
We live in an era where data are ubiquitous and there is ever increasing interest in making predictions about future events based on past data. Such is the focus of predictive modeling – a subject that overlaps with the fields of statistics, machine learning, artificial intelligence, data science, and others. The practice of predictive modeling is reliant on software for the fitting, evaluation, and application of models. A large number of predictive modeling techniques are available as R software packages. However, interface and feature differences across packages can make their collective use challenging. In this talk, I will discuss the new R MachineShop package for statistical and predictive modeling. MachineShop aims to unify techniques from different packages by providing a common interface for model fitting, prediction, performance assessment, and presentation of results. Support is currently provided for 51 models from 26 R packages, including traditional regression, regularization methods, tree-based methods, support vector machines, neural networks, and ensembles, as well as for data preprocessing, filtering, and model tuning and selection. Model predictive performance can be quantified with a range of performance metrics and estimated with independent test sets, split sampling, cross-validation, or bootstrap resampling. Analysis results can be summarized with descriptive statistics; calibration curves; variable importance; partial dependence plots; confusion matrices; and ROC, lift, and other performance curves. This talk will provide an accessible introduction to the package, followed by a demonstration of its easy-to-use, yet powerful paradigm for model tuning and selection.
For analysis of spatial data, geographers and statisticians have introduced various approaches to removing spatial autocorrelation in regression residuals by augmenting the design matrix with vectors that represent spatial patterns. We propose a fully Bayesian method that balances model fit and reduction of residual correlation. It is computationally fast and performs competitively with established methods. We illustrate with data on the 2018 Iowa gubernatorial election. This is largely joint work with my former PhD student Juan Cervantes.
It is well-established that contact network structure strongly influences infectious disease dynamics. However, less well-studied is the impact of network structure on the effectiveness and efficiency of disease control strategies. In this talk, I will present an evaluation of partner management strategies to address a hypothetical bacterial sexually transmitted infection (STI). I will compare the costs, disease outcomes, and cost-effectiveness of three partner management interventions (partner notification, expedited partner therapy, and contact tracing) in populations with the same average behavior, but configured according to different network structures. This case study is one demonstration of how network structure can influence both the effectiveness and efficiency of infectious disease interventions, as well as the interplay between intervention capacity constraints, disease dynamics, and network connectivity patterns.
In radiologic diagnostic imaging studies the goal is typically to compare the performance of readers (usually radiologists) for two or more tests or modalities (e.g., digital versus film mammograms, CT versus MRI, etc.) to determine which performs better. The most common design for such studies is a paired design where each reader assigns confidence-of-disease ratings to the same images using each test, with reader-performance outcomes estimated by functions of the estimated receiver-operating-characteristic (ROC) curve. Examples of such reader-performance measures are the sensitivity achieved for a given specificity or the area under the ROC curve (AUC), which estimates the probability of correctly discriminating between a randomly chosen pair of normal and abnormal images.
Typically the researcher wants to account for two sources of variation in these studies, attributable to variation across patients and variation across readers. Although there are standard statistical methods for accounting for multiple sources of variation, these imaging studies present a unique challenge for the statistician because the outcome of interest, the reader performance measure, is not indexed by case. Thus, e.g., a conventional linear or generalized linear mixed model with reader and patient treated as random effects cannot be used. Presently, the standard analysis approach is the Obuchowski-Rockette method. I will present an introduction to the Obuchowski-Rockette method, describe its present level of development, and outline future areas of research.
Data are quite important. And with big data, there are more and more data elements to contend with. The 3 V’s of big data: velocity, volume, and variety attest to this. But are all data created equal? NO. So the statistician has an ongoing and increasingly important role to assure relevant, representative data are being analyzed. This talk will discuss where data analytics meets statistics and some of the great potential and, yes, the pitfalls, of our deriving useful information from all that data. It also includes examples from the author’s real-life involving court cases and presentations for the President.
Key words: Big data, analytics, statistician’s role, pitfalls