Statistical approaches to data integration

Summary
Task 47 Leveraging high dimensional data for improved predictions and participantlevel inference in the context of a restricted NThe retrieval and linkage of multiple EC and HDL from infectious disease cohorts enables researchers to obtain detailed information on the environment and lifestyle for thousands of individuals The resulting highdimensional data sets may then comprise of different types of molecular measurements such as DNA RNA and protein In this task we will explore how HDL data can meaningfully be combined across cohort studies and be used for risk prediction purposes A key concern is the danger of overfitting which occurs when statistical models overemphasize the associations present in the available cohort data and fail to provide accurate predictions for new individuals It is for instance often possible to develop a model that fits the data perfectly even in the absence of any true association Rigorous penalization and validation is therefore crucial to ensure that developed prediction models are externally valid Additional concerns arise when pooling OMICS data from multiple cohorts First because the statistical power to detect predictive associations tends to increase risk prediction models may become too complex and require a huge number of measurements to provide a personalized risk prediction This may lead to unnecessary costs Second the presence of betweenstudy heterogeneity across cohort studies may substantially affect the external validity of model predictions For instance recent studies have demonstrated that the performance of risk prediction models may vary according to the characteristics of gene expression data sets and that these characteristics can vary within specific diseased populations Heterogeneity in risk predictions may also appear when important causal pathways are ignored or when measurements are of different quality across datasets If heterogeneity in risk predictions is ignored prediction models may have limited applicability and require substantial revision before they can be used in clinical practice In this task we will evaluate and extend statistical methods for dimension reduction and penalization in a metaanalysis context in order to enable the development of generalizable risk prediction models from sparse and heterogeneous samples Hereto we will build upon group LASSO and groupedregularized ridge regression and implement new penalty measures to reduce betweenstudy variation of prediction error