Genetic association studies: not as simple as you might think

Now that you are somewhat familiar with some of the main concepts in human genetics (thanks to the crash course in the previous post), we are ready to start talking about the analytical challenges of statistical genetics, and finally start delving into the open problems listed in the first post.

At this point, you should know what a variant is, what’s the difference between Mendelian and complex traits, and what GWAS tries to do.

So if you remember, running GWAS is in essence straightforward: we examine each of the genotyped variants, calculate a p-value for its association with the phenotype, and report all the variants with a p-value below a certain threshold (5e-08) as significantly associated with the trait. If we were just interested in statistical correlations, that would be pretty much it. But what we really care about, most of the time, is causal connections.

Just knowing that a variant is correlated with a trait is just not very useful. If we want to really understand the effect, we first need to make sure that we have found a true causal association, and to pinpoint the exact causal variant(s). There are two main obstacles to doing that: population structure and linkage disequilibrium (LD).

Let’s begin with the first.

Population structure (or, if only people mated at random)

As any statistician will be always happy to tell you, whether or not you asked for their opinion: correlation does not necessarily imply causation. If x is associated with y, it’s generally a mistake to conclude, without further evidence, that x affects y. The most common reason for non-causal associations (known as spurious correlations) is confounding, which happens when a third variable z affects both x and y. The confounding variable z is called a confounder. There can be multiple confounders.

For example, if you compare children across different age groups, you will find that children with larger shoe sizes tend to also be much better at reading and writing. But it’s not because big shoes help children obtain a good grasp of grammar (nor reading Dr. Seuss will make your child’s feet grow any faster). It’s just because both feet and reading skills tend to expand as children grow older. In other words, age is a confounder. 

In the case of genetics, if we see a variant correlated with a trait (say height), we need to ask ourselves: what other variables could possibly affect people’s height AND their genetics? The most likely explanation is that height (like pretty much any meaningful human trait) varies quite substantially across geographic regions. For example, the average height of males in the Netherlands is ~1.82 meters, compared to ~1.64 in Vietnam (or, 6 feet vs. 5’5” if you are a savage who uses Imperial units). Genetics also varies a lot between regions. If we compared Dutch and Vietnamese genomes, we would find tons of variants with different allele frequencies between these two populations, and most of them would have nothing to do with height (or any other trait for that matter). Most likely, it would just reflect random genetic drifts – the fact that the Dutch and Vietnamese people had different ancestors who happened to have different alleles, and along the centuries, just because Dutch ancestors tended to mate other Dutch ancestors (while the Vietnamese ancestors did much the same), these ancestral alleles have come to dominate these separated human populations.

Geneticists will use various terms to describe the fact that people come from different families, and therefore have different genetic and environmental backgrounds. You can call it population or ancestry, it doesn’t really matter. What matters is that the family you are born into affects both your genetics and your environment, and both affect your traits, meaning that it confounds the interactions between genetics and traits.

Accounting for population structure

The same statistician who has just given you their unsolicited opinion about correlation and causation will also add that, if you encounter a confounding problem, the standard way to deal with it is through regression analysis. Basically, you take the variable z that you suspect for a confounder, and add it (or some proxy of it) to your analysis as a covariate. That is, instead of asking whether x correlates with y, ask whether x correlates with y after adjusting for (or correcting for) z. For example, you’d test whether reading skill correlates with shoe size when considering children of the same age. I’m not aware of any study that has examined this pressing question, but I’d guess that the answer is no: once accounting for age, the statistical correlation between reading skills and shoe size is totally gone, which would be a strong indication that they don’t really affect each other.

So how do we apply this methodology in the context of human genetics? For a start, we could make sure that we run association studies on individuals belonging to the same high-level ancestry groups, for example by running the association tests on people of European and African ancestries separately1. That would help tremendously, and geneticists indeed do that, but that would only provide a partial solution to the problem. Even if we constrained ourselves to a cohort of a very distinct population (say white British people), some variants would still show distinct geographic patterns within that cohort, being more common for example in the south of the UK. Most traits would also show differences at this smaller resolution. For example, maybe the rates of asthma around London are different from those around Canterbury. So we are pretty much back at square one, having made only limited progress eliminating population effects.

A (made-up) variant with different prevalence across the UK.

If we are haunted by geographic patterns, why not account for geography directly? We could just include the home address of each individual (say their zip code) as part of the covariates we account for and be done with it. The problem is that what’s haunting us is not geographic patterns per se, but rather ancestry (which only correlates with geography). Your place of residence (or birth) doesn’t really affect your genetics; it’s your ancestry that does. The type of family you are born into affects many aspects of your lives and determines your genetics, regardless of where you live. Overlaying these patterns on a map only makes it easier to see the problem.

What we really need is some way to measure people’s ancestries. It would be really helpful if we could assign each individual a vector of numbers (say 0.1, -2.1, 3.6, 4, 0.7) capturing their ancestry, and then include them as covariates and account for ancestry once and for all. But how do we project a concept as complex as ancestry (which, one could argue, is not even well defined) into a series of numbers?

Thankfully, it turns out there are some pretty effective ways to do that. By far the most popular method is principal component analysis (PCA). In high level, PCA is a dimensionality reduction method. It allows you to take high-dimensional data (many variables) and embed it into a low-dimensional representation space (few variables) in a way that preserves as much information as possible. “Information preservation” means that you could recover most of the high-dimensional data with as much accuracy as possible (some of the information will be lost of course). If you want the mathematical details of PCA or other dimensionality reduction methods, there is no shortage of tutorials and other content on the subject throughout the internet. 

Back to genetics. What would happen if we ran PCA on variants? If you remember, that’s exactly what we saw in the short demo on the last post. I ran PCA on 1,000 variants chosen randomly on chromosome 17, and this was the result: (recall that colors correspond to the five main ancestries in the 1000 Genome Project)

The PCA algorithm creates two new variables, PC1 and PC2, which are some linear combination of the 1,000 original variants, in a way that captures as much of the genetic variance across these 1,000 variants. It turns out that the variables capturing as much genetic variance as possible are strongly associated with ancestry. It actually makes a lot of sense if you think about it. If you had to give me just one piece of information about someone, hoping that I could then guess their genetics across as many locations in the genome as possible, just telling me their ancestry would actually give away most of the information.

In this example we ran PCA on individuals of very diverse ancestries (e.g. Europeans, African, Asian), so we got principal components that mostly capture these high-level distinctions. A better idea would be to first split the individuals based on high-level categories (such as self-identified ethnicity), and only then run PCA and GWAS on each group separately and use the principal components as covariates. That would allow the PCA to focus on sub-populations within each ancestry. In modern genetic studies, it is common to include somewhere between 5 to 40 principal components as covariates. Of course, unlike this toy example, PCA should be performed over all the genotyped variants in the genome (not just a subset of 1,000 random variants on chromosome 17).

Covariates are usually included through regression analysis, usually some form of linear regression or logistic regression (depending on whether the phenotype is continuous or binary). In linear regression, variants are tested with the model y = β0 + β1 · c1 + …+ βk · ck + βk+1 · g , where:

  • y is the phenotype value (e.g. height).
  • c1,…,ck are the covariates included in the model (principal components, age, sex, etc.).
  • g is the variant’s genotype values (typically 0, 1 or 2, depending on the number of alternative alleles an individual has, but it can also be encoded in other ways, as we’ll see).
  • β0,…,βk+1 are the model’s coefficient parameters (that some optimization algorithm will fit to the data).

Next, a p-value is calculated for the null hypothesis H0: βk+1 = 0. In other words, the full model y = β0 + β1 · c1 + …+ βk · ck + βk+1 · g is compared to the null-hypothesis model y = β0 + β1 · c1 + …+ βk · ck (without the last term representing the tested variant). If including the variant significantly improves the model’s prediction of y (by some statistical measure that I won’t get into here), then the variant is considered statistically significant, after accounting for the covariates.

Following this protocol, it turns out, eliminates most of the population structure in genetic studies, and makes them pretty robust. But it still doesn’t solve the problem completely.

Residual population structure

This protocol – splitting the cohort into high-level ancestry groups and accounting for the principal components of genetic variation within each group through regression analysis – has been the standard way of running association studies over the last decade or so. And it’s worked pretty well overall. 

But increasingly, there are concerns that this is not enough. True, following this protocol eliminates most of the population structure, but there remains some residual population structure that is not accounted for. This residual population structure is not a big issue when we study cohorts of thousands or even tens of thousands of individuals, but nowadays, with improved genotyping technology (and a lot of money), we have crossed the hundreds-of-thousands domain and are heading towards cohorts with millions of individuals (and beyond). At these huge scales, even small biases can make your statistical test come out positive even though it shouldn’t.

As we’ll soon see, there are quite promising strategies to deal with the problem of residual population structure. But it’s important to stress out that it is still very much an open problem that the field is still grappling with. Methodological papers are being published on a regular basis, and there is definitely room for more innovation.

So here goes the introduction of the first open problem in this series (open problem #1: population structure) out of the 16 open problems I presented at the beginning. More open problems to follow soon (notice that, generally, they will not be presented in the same order they are listed in the table).

Before we introduce more open problems, let’s review some promising strategies to deal with residual population structure.

Better correction methods

We saw that PCA is quite effective for accounting for population structure, but PCA is a rather simple method, and by no means the best we can do. A more modern approach is called linear mixed models (LMM). Again, I will spare you the mathematical details, but the general idea is to account for more nuanced population structures in addition to the main axes of variation captured by PCA. There are many possible LMM algorithms and implementations that have been experimented with, and frankly I don’t know the details of most of them.

Trying to correct and account for more and more of the population structure can be helpful, but it’s also a dangerous game to play. If you end up correcting too much, you may end up accounting for variables that are fundamentally tied to the variant you are testing, which will end up eliminating some of the actual causal signal. It’s difficult to say when exactly we should stop – when the genetic variation that is due to population structure ends and the genetic variation that is due to individual differences begins (recall that the whole concept of “population” or “ancestry” is a bit vague to begin with).

Also, when we study human groups that are too homogenous, we are at risk of underplaying the potential genetic basis of phenotypic differences between groups. Let’s say that we want to understand the genetic basis of skin pigmentation. If we study people of European and African ancestry separately (as one usually does), we may capture the genetic factors that make some Europeans or Africans slightly darker skinned than others within the same ancestry groups, but we might miss the (in this case, much more substantial) genetic effects that give African-descent individuals much more skin pigmentation than individuals of European ancestry. 

Skin pigmentation is just an easy-to-comprehend example of human group differences determined by genetics. More importantly, many human diseases show dramatically different prevalence across ancestry groups, and it could be very helpful to know to what extent it is because of genetic or environmental differences (and what the exact genetic and environmental factors are).

Cross-ancestry studies

As we’ve seen, ancestry diversity presents many methodological pitfalls for genetic studies, and it needs to be accounted for. But it’s also a blessing, and can provide another strategy to deal with population structure.

To see why, let’s imagine that some genomic region came out GWAS-significant in some human ancestry, say East Asian. By now we are well aware that it doesn’t necessarily mean that this genomic region is actually causal, because the association could be driven by subpopulations of the East Asian ancestry group. For example, maybe Chinese have more of the tested variant and are more susceptible to the disease we are testing than Japanese. Such a pattern could be perfectly sufficient for a spurious association to pop up. But now let’s imagine that the same genomic region is also GWAS-significant in a whole different human ancestry, say African. Of course, it could be that just by coincidence we observe the exact same pattern in sub-populations of this totally different ancestry, but it starts sounding much less plausible (and the more so the more fundamentally different ancestries we test). In other words, if the same genomic region is replicated across multiple independent ancestry groups, we get a much stronger overall evidence for the causality of the association. 

Sadly, present-day genetic cohorts are massively overrepresented by individuals of European descent, which is another open problem on our list (open problem #7: ancestry diversity). This representation bias really sucks for underrepresented groups (pretty much everyone who is not of European descent), because so many of the findings that get published in genetic studies apply only to individuals of European ancestry.

This bias has very real implications when people go to the clinics to get genetics-based advice. If you are non-white, there is a much higher chance you will hear your doctor say “ummm, I really don’t know whether this applies to you, there is just not enough data on your group right now”.

Less trivial is that this situation also sucks for whites (though to a lesser extent). Because we don’t run enough cross-ancestry studies, we are giving up on a very powerful tool to study human genetics, and we remain with a lot of uncertainty about the causality of genetic associations, which slows down the pace of biomedical research for everyone (including whites). 

GWAS is overrepresented by people of European ancestry, and things are not getting much better. Source: Martin et al., Nature Genetics (2019).

Family-based studies

Before we move on to the next topic, let’s talk about a third promising strategy to deal with population structure. This one is really my favorite. The idea here is to embrace ancestry patterns to the fullest. Instead of aiming for a cohort of “unrelated” individuals and hoping we could make the kinship ties go away, we will instead recruit families where we know exactly how those ties play out, and fully account for them.

The simplest case would be trio studies, where each family is just a triplet of a mother, a father and a child. Imagine that we want to test a variant with two alleles, A and B. To make things simple, let’s focus on families where one parent is AA (i.e. having twice the A allele) and the other is AB (i.e. having once the A allele and once the B allele). It doesn’t matter which is the mother and which is the father. According to the laws of inheritance, we know there’s exactly a 50% chance that the child will be AA and exactly 50% they will be AB. Now, we can test how this perfectly-understood background probability changes in the presence of the phenotype. If among trios where the child has a disease we see a significant deviation from the expected 50-50 frequency, that would indicate that the tested genomic region affects the risk of getting the disease.

Because the background probability is so simple and well-understood, we don’t even need controls (children without the disease) in this type of study, even though in practice it might be a good idea to test them as well (to make sure we have not messed things up). In principle, we could conduct a trio study using only cases.

Family-based studies are the only type of observational genetic studies that eliminate the problem of population structure altogether. Because we fully account for the genetics of each individual’s parents, there is no more room for ancestry to sneak in and confound our studies. Unfortunately, family-based studies are quite rare nowadays, and most genetic resources are still mostly based on the “unrelated”-cohort study design (open problem #6: family-based vs. population-based cohorts).

There are several reasons why family-based studies are not very popular, even though they solve many methodological problems (and I really think they should be more popular). The main reason, I think, is that it’s much more difficult to recruit families. For a start, you need to get everyone’s consent and participation. If the father won’t cooperate, the mother and child are gone too. Another problem is that you need much more data. In the trio case, every three individuals end up providing you with only one data point (the child). Studying bigger families would be more efficient in that way (but more difficult to recruit). For example, a family of five (two parents and three children) can give you three effective samples, which is a much better ratio (⅗ instead of ⅓).

Personally, I believe that family studies will inevitably become more popular, because it really appears to be the only way to conclusively answer some questions in human genetics.  

The convoluted puzzle of LD

At the beginning of this post I mentioned two main obstacles for concluding that a significant genetic association is causal. We’ve spent some time discussing the first (population structure), so let’s now turn to the second: linkage disequilibrium (LD). If you remember from the previous post, LD is when different variants in the genome are statistically correlated.

The reason for LD is a biological phenomenon called homologous recombination.  Remember that each cell in our body contains two copies of each autosomal (non-sex) chromosome: one from our mother and the other from our father. When cells divide (as is happening all over our bodies all the time), the genome generally goes unchanged (aside from occasional DNA replication errors leading to mutations). But what about gamete cells (sperm or egg cells) which pass on to the next generation? If they also went unchanged, that would be a pretty lame way for our genetic material to move along generations. This is somewhat of a digression, but let’s spend a moment to see why.

If parental chromosomes never mixed, we would inherit a chromosome from our mother that she got from one of her parents, that they inherited from one of their parents, and so on. In essence, if that was how inheritance worked (it isn’t), we would end up with the same repertoire of chromosomes passing between generations for eternity (aside from natural selection of entire chromosomes and occasional mutations). Such limited genetic repertoire would really make sexual reproduction sort of pointless and wouldn’t allow natural selection to do much with it. If for example there appeared several variants with some selection advantage (across different individuals), evolution would have no way of consolidating those variants into the same chromosome, meaning that each individual could only have one or two of the useful variants, but not all of them. We might as well just reproduce asexually.

But, thankfully, that’s not how inheritance works. Homologous recombination allows each pair of sister chromosomes to shuffle together in gamete cells. This means that the copy of chromosome 17 that you got from your mother isn’t an exact copy of that chromosome from either your grandmother or grandfather on your mother’s side. Rather, it’s some mixture of the two. So you have a copy of chromosome 17 that comes entirely from your mother, but it doesn’t come entirely from either of her parents. 

I will not get into the biochemical details of how homologous recombination works. At high-level, there occur some swappings of stretches of DNA that are sufficiently similar to one another (specifically between corresponding genomic regions at the two copies of each of the chromosomes). When all is said and done, you get two chromosomes that are somewhat shuffled, but the process is really far from perfect shuffling. In fact, there are only a few crossover events in each of your chromosomes. So variants on the same chromosome do get a chance to shuffle along a sufficient number of generations, but two variants that are close to each other are very likely to end up being inherited together. 

At the population level, the net result of all of that is that variants on the same chromosome are often correlated (in the sense that people with a specific allele of one variant will tend to have a specific allele of another variant), and the closer variants are on the chromosome, the more likely they are to be strongly correlated.

For example, the LD pattern in a small DNA region containing just 6 variants can look something like that:

In a somewhat larger region it will look something like this:

What does it mean for GWAS?

Imagine that some causal variant affects a trait, but this variant is in LD with some other variants (which is a fancy way of saying that they are correlated). In that case, not only the causal variant would be GWAS-significant for the trait, but many other variants (that are in LD with the causal variant) as well. If we zoomed into that region on the Manhattan plot at, we would see something like this:

Very often, the most significant variant in a region (with the most extreme p-value) is not a causal variant. Rather, it obtains its strong statistical association only because it’s in LD with another causal variant (plus some random luck). Add to that the fact that there could be multiple causal variants in a given region (for example, different variants affecting the amino-acid sequence of the same protein-coding gene) and what you end up getting is a pretty messed up genetic sudoku.

Given this LD puzzle, even when population structure is properly accounted for and geneticists declare to have found a causal genetic association, they will usually report an entire genomic region rather than individual variants, because, in most cases, no one really knows which of the variants is causal (open problem #12: from association to causality). Each peak on the genome-wide Manhattan plot, with all the variants it covers, is considered a distinct genetic association.

I’d say that the problem of LD and causality is more conceptually challenging than population structure, because if two variants are in close to perfect LD (always appearing together), there is just no purely statistical way to distinguish between them (without considering other pieces of evidence).

Why do we really care which of the variants are causal?

Let’s pause for a minute and consider why this is even a problem. For some purposes, we don’t really need to know which of the variants is causal, and it could be sufficient just to know that there’s a statistical association, or that the region as a whole is causal. For example, if we want to assess someone’s risk for heart disease, it could be enough to know that they have a specific variant strongly associated with heart failure, whether or not it’s an actual causal variant.

But for many goals, we first need to know which of the variants is causal. Such goals include:

  • In-depth study of the causal variants (often through cellular or animal experiments), to better understand the disease and its genetic basis (for example, if there are causal variants in a gene involved in some metabolic process, it’s a strong indication that this metabolic process might have something to do with the disease).
  • Guiding the development of new drugs and treatments (for example by targeting the hypothetical metabolic process, or even the genes or variants themselves). 

In several years from now, genetic editing technology may become mature enough to allow the editing of embryonic genomes, for example to minimize or eliminate the risk of newborns to develop certain diseases throughout their lifetimes. The idea of genetically modifying babies obviously raises many ethical and medical concerns which I will not get into here. Whether or not you think that would be a good idea society-wise, doing so would first require us to identify the causal variants.

Fine-mapping strategies

So we implicated a genomic region as causal for a trait, and we want to know which of the variants in this region are behind this genetic effect. A good starting point would be to try to narrow down the list of variants and identify a subset of the variants containing the causal genetic factor(s) with high probability, a process called fine-mapping of GWAS results.

There is a huge literature on fine-mapping techniques. These methods often involve some sort of Bayesian reasoning – integrating multiple sources of evidence to assess the overall probability that a variant is causal, or that a set of variants contains the causal effect(s).

What sources of evidence can we rely on other than the statistical genetic studies used to implicate the genomic region in the first place? Most of the time it will be some genomic annotations. For example, knowing that variant A changes the amino-acid sequence of a coding gene, whereas variant B is pretty far from any known functional element in the genome, might provide additional evidence in favor of variant A over B.

Fine-mapping methods can be useful to narrow down the list of suspected variants, but in the end they have limited power to fully solve the problem. Even if a variant seems less plausible because it doesn’t seem to affect anything substantial in the genome, we need to keep in mind that we are still far from knowing everything there is to know about the human genome. Perhaps the variant has a yet undiscovered functional effect. At best, fine-mapping will allow us to have a sufficiently narrow and plausible list that we could start testing experimentally.

I will not get into how the effects of variants can be tested experimentally. Just as an example, in some cases it’s possible to run animal experiments (usually on unlucky mice). There are many assumptions going into these sorts of experiments, for example that the disease and gene that we study have good analogues in the animal. But if we manage to induce a similar genetic change in the animal and get a phenotypic effect that looks similar (to what we statistically observe in humans), that would be strong evidence that we have detected a true causal effect.

Fine-mapping through cross-ancestry studies

Remember that we spoke about how non-white individuals are underrpresented in GWAS, and how powerful cross-ancestry studies could be for overcoming the problem of population structure (open problem #7: ancestry diversity)? Turns out that cross-ancestry studies are also super useful to resolve some of the LD puzzle and fine-map GWAS results, for much of the same reason that they are useful for dealing with population structure.

Just like different ancestry groups show distinct (sub)population structures, they also have very different LD patterns. If the exact same variant is significant across multiple ancestry groups, each having a totally different LD pattern (in particular a different set of other variants correlated with that variant), then we end up with a much stronger overall evidence for this variant being causal.

But even cross-ancestry studies are not a magical panacea. When we compare GWAS results across different groups we assume that the causal effects are the same in these groups, which is not always the case. A variant can be causal in the context of one group but not another (or have different effect sizes), due to interactions with other genetic or environmental factors (I’ll talk about how exactly it plays out in a future post). Another common problem is that a variant is common enough only in one ancestry group, so we can’t really study it in other ancestries. 

Burden tests and gene-base methods

I hope you are convinced by now that establishing the causality of specific variants is really hard. But just saying that we found a significant genomic region and calling it a day is usually not good enough either. Are we totally screwed?

In many cases there is a viable middle ground. Saying with certainty exactly which variant is causal might be too difficult, but maybe we could make causal claims at a somewhat lower resolution. For example, maybe we could say that an entire gene is causal, even though we don’t know exactly which variants modify its function and lead to the phenotypic effect.

Making causal claims about whole genes could oftentimes be sufficient to meet our goals. We will still have a good lead to start interrogating the mechanism of the genetic effect (for example, trying to knock out the gene or modify it in different ways and see what happens), and we could still have a clear target we could go after pharmaceutically. 

Because significant genomic regions often cover multiple genes, just picking the gene closest to the most significant variant would in most cases be too naive. But from a statistical perspective, working at the resolution of genes puts us at a much better position, because now we have a lot more data (multiple variants) that could point us in the right direction. 

A simple but powerful approach is to conduct burden tests. The idea, as the name suggests, is to look for genes that appear to be overwhelmingly burdened by a large number of significant variants (preferably LD-independent ones). If we see multiple variants in the same gene that are not substantially correlated with one another, all associated with the trait, that would be strong evidence that the gene is causal. 

Here too we can take into account more information about the genome to get more statistical power through some prior knowledge. For example, if we see that variants that appear to damage the protein product of the gene tend to be significant (as well as variants in LD with those variants), while other variants tend to not be significant, that’s even more evidence to judge the gene guilty.

Summary

Finding causal genetic associations based on GWAS is not as simple as it may appear at first. The main two hurdles are population structure and linkage disequilibrium. Luckily, there are many promising strategies to deal with these problems. This is a very active topic of methodological research. 

While discussing these challenges, we’ve covered four open problems in contemporary statistical genetics: open problem #1: population structure, open problem #6: family-based vs. population-based cohorts, open problem #7: ancestry diversity and open problem #12: from association to causality.

Our tour of open problems in statistical genetics will continue in an upcoming post.

Footnotes

  1. While it’s helpful for avoiding confounding, separating association tests by high-level ancestry groups has the major drawback of excluding people of mixed ancestries.

Join the Conversation

6 Comments

  1. Love these posts – such a great introduction to a field that I’ve been interested in exploring for a long time. Thanks, looking forward to the next one

  2. Simply wish to say your article is as amazing The clearness in your post is just nice and i could assume youre an expert on this subject Well with your permission let me to grab your feed to keep updated with forthcoming post Thanks a million and please carry on the gratifying work

Leave a comment

Your email address will not be published. Required fields are marked *