Building Genetic Scores to Predict Risk of Complex Diseases in Humans: Is It Possible?

  1. Yiqing Song5
  1. 1Program on Genomics and Nutrition, Department of Epidemiology, University of California, Los Angeles, Los Angeles, California;
  2. 2Center for Metabolic Disease Prevention, University of California, Los Angeles, Los Angeles, California;
  3. 3Center for Human Nutrition, University of California, Los Angeles, Los Angeles, California;
  4. 4Department of Medicine, University of California, Los Angeles, Los Angeles, California;
  5. 5Division of Preventive Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts.
  1. Corresponding author: Simin Liu, siminliu{at}

Decades of research have identified numerous biomarkers for cardiovascular diseases (CVDs) and type 2 diabetes, providing molecular insights for improved treatment and prevention of the diseases (13). Of the biomarkers that could be objectively and systematically measured, genetic variants such as single nucleotide polymorphisms (SNPs) have some unique features in that they do not change over time, and the temporal sequence of genotype-phenotype can be clearly established for outcome prediction.

Using high-density fixed SNP arrays, recent genome-wide association studies (GWAS) have successfully identified multiple risk alleles related to CVD and type 2 diabetes. These advances in genomics present many exciting opportunities in three scientific domains: 1) integrating novel genetic variants into risk prediction models of complex diseases in humans, 2) characterizing new biological pathways involved in pathogenesis and thus improved strategies for treatment and management, and 3) enhancing inference of traditional epidemiological work relevant to public health importance. To capitalize on these opportunities, several groups have attempted to develop genetic risk scores by summing up the number of risk alleles for disease prediction. However, almost all these studies have concluded that current genetic information contributes little information in distinguishing who will or will not develop a CVD or type 2 diabetes among apparently healthy adults (46).

Given that most common risk variants identified so far confer relatively modest risk to these complex diseases (e.g., all risk alleles for type 2 diabetes identified by GWAS have very small relative risks [<1.50]) (7,8), the “common diseases-common variants” model has been formally challenged (9,10). In the field of complex disease genetics, it is now widely anticipated that some ongoing next-generation sequencing work covering the whole genome in diverse populations would identify rare variants of large effect sizes in the coming years (8). Yet, there still remain many questions that must be answered before genetic information can be appropriately incorporated into risk prediction models for complex diseases (Fig. 1).

FIG. 1.

Assessing and integrating reliable genomic information in the development of clinical risk prediction model. CNV, copy number variant.

In this issue of Diabetes, Palmer et al. (11) report findings of using yet another genes-based score to predict stroke risk in a cohort of 2,182 patients followed for ∼6 years. The authors selected from prior work a set of five variants involved in inflammation and developed a score by summing up “at-risk” genotypes for those variants. By assigning a score of 1 for having at least one risk allele and 0 for noncarriers, Palmer et al. implicitly assumed that these five loci follow either dominant or recessive genetic patterns. Previously, Morrison et al. (12) advocated an additive model with weighing of −1, 0, and 1, as did others (46).

None of these studies, however, have attempted to weigh the loci using regression coefficients from the specific proportional hazard function. Put simply in regression terms, Palmer et al. in effect converted a set of five dichotomous variables into an ordinal variable in relating genetic variants to risk of stroke in their model. Whether this is reasonable depends on the nature of the genotypes-disease relationship that is inherently defined by the specific model form. With the use of Cox proportional hazard model, an ordinal “at-risk genotype” score implies an exponential relationship in that each added “at-risk genotype” multiplies the baseline risk by a constant value corresponding to the antilogarithm of the regression coefficient (following the survival function Yi = 1− {s[t]}exp{A + B × Xi}; where Yi is predicted probability for developing stroke over time t (t was event free follow-up time for individual i); Xi represents the genotype scores [0,1,2,3,4,5]). Given that during a mean follow-up of ∼6 years none of these five variants were independently associated with stroke risk, the evidence in support of an exponential shape of relationship between these genetic variants and disease risk appeared weak. Only when converted into an ordinal variable did it become statistically significant with a hazard ratio of 1.34 for each “at-risk genotype.” This apparent gain in statistical efficiency can only be achieved with significant constraints that are model-dependent and thus has very limited implication for inference beyond the samples investigated by Palmer et al. (11).

It would be helpful to examine the distributions of traditional risk factors for specific types of stroke (e.g., family history, diet, physical activity, diabetes duration, and levels of glycemic control) by this genetic score. With ∼1% increment in the area under the receiver operating characteristic curve, this ordinal genetic score (even with strong linearity assumption in a multiplicative scale) apparently did not contribute to discrimination. Formal evaluation of prediction should also be conducted to assess improvement of fit for inclusion of each locus genotype separately and fit for the entire model by computing likelihood ratio χ2 statistics and Bayesian information criteria (fit for the entire model taking into account the number of parameters).

Aside from using genetic variants for risk prediction, recent GWAS have also started to uncover potentially new biological targets for complex diseases. Since the first GWAS for type 2 diabetes published in 2007 (13), subsequent efforts have confirmed at least 20 robust and well-replicated genetic loci associated with the disease (7). Interestingly, some identified regions have never been suspected to be involved in the pathophysiology of type 2 diabetes, including a common variant in the FTO gene (rs9939609) (14). Several studies have now confirmed the association between FTO variants and higher BMI and obesity in both children and adults (15,16). It was thus surprising that in building their risk score, Palmer et al. (11) chose to ignore recent GWAS findings for stroke (17) as well as many important candidate genes in the pathways of inflammation and endothelial dysfunction (18). It remains possible that the addition of a much larger number of common or rare risk alleles based on a better understanding of inflammatory mechanisms underlying CVD could improve risk prediction.

Meanwhile, emerging evidence indicates sex differences in genetic susceptibility to CVD among diabetic patients (19). In the U.S., CVD mortality has declined substantially in recent decades among nondiabetic individuals, but has declined only among diabetic men and increased significantly in diabetic women (20). The reason for the accelerated atherothrombotic events in diabetic women remains poorly understood. Traditional CVD risk factors such as hypertension and dyslipidemia cannot completely account for the apparent sex differences in the excess CVD risk associated with diabetes (19). Because inflammation and endothelial function are more seriously affected by diabetes in women than in men and because diabetes may cause greater shift to “android” obese pattern in women than in men (21), recent work has also intensified the search for sex-specific associations between variants of these genes and CVD risk and has developed sex-specific risk prediction models (19,22,23).

More importantly, future risk assessment for complex disease should take a much more careful consideration of gene-gene and/or gene-environmental interactions. Complex diseases such as CVD and type 2 diabetes are influenced by both genetic and environmental factors. For example, most GWAS to date have been conducted in middle-aged and older adults so that the cumulative effects of multiple environmental effects or other gene-gene or gene-environment interactions in older age may have diluted a modest but real genetic effect that may be more apparent earlier in life. Such incomplete understanding of genetic and environmental causes and their interactions appeared to have confounded those who attempted to identify a set of SNPs that could adequately explain or predict even a small fraction of complex diseases (24,25). As the field of genomics progresses, it is imperative to confirm and better characterize genetic variation (i.e., better resolution of our genomes) via fine-mapping, functional testing, integrating mechanistic analysis of intermediary phenotypes, and assessment of gene-environment interactions in multiple racial and ethnic groups. Multiethnic replications are useful in uncovering true susceptibility genes by identifying multiple significant hits within a specific region, which is particularly valuable given allelic heterogeneity of the genetic effects (different alleles may cause the disease in different populations) (26). Yet, even with these anticipated progress in genomic sciences, the preventive utility of using genetic score alone for common diseases in adults will likely be very limited, especially considering the myriad of environmental factors that also influence the development of complex diseases. With a better understanding of pathogenesis, however, integrating genetic variants with their biochemical phenotypes, as recently demonstrated in a study of sex-hormone–binding globulin and type 2 diabetes risk, should be a viable strategy to provide molecular insights and improve disease prediction (22,27). Ultimately, greater further efforts will be required to put valuable genetic information in the appropriate biological and clinical context (including cost-benefit evaluation following principles of screening) to optimize risk assessment for prevention.


No potential conflicts of interest relevant to this article were reported.


  • See accompanying brief report, p. 2945.

Readers may use this article as long as the work is properly cited, the use is educational and not for profit, and the work is not altered. See for details.


| Table of Contents

Navigate This Article