Understanding the genetic basis of the human condition – 16 analytical challenges

A data scientist’s guide to statistical genetics

Our genetics shapes almost every part of ourselves we care about: how we look and behave, what we are good or bad at, and which medical conditions we are most likely to encounter in life. Virtually all top causes of disease burden in developed countries, whether it’s cancer, heart disease or addiction, have genetics as a primary factor. But we don’t yet understand most of these genetic effects.

In this series of posts, I’ll try to explain why we don’t understand the genetics of most human traits very well, and what concrete problems need to be solved to change that. But first, I want to acknowledge that we do already know many valuable things about human genetics. Nowadays, geneticists routinely counsel parents who carry dangerous mutations and worry they would pass them on to their children, or diagnose children with severe developmental disorders. We can also identify some individuals at very high risk of developing cancer (like Angelina Jolie), and in some cases, we can choose personalized treatments that suit the genetic profiles of patients.

But in most other settings, we can’t (yet) make much better decisions with genetic tests than we would make without them, even though in principle we should be able to. For example, we don’t know how to accurately estimate someone’s predisposition to schizophrenia based on their genetic profile.

So how do we get a better understanding of how our genetics affects us? To a large extent, this is an analytical challenge of collecting and making sense of genetic and medical data. I recently published a review paper listing 16 open problems in statistical genetics that we considered to be the main bottlenecks for closing this knowledge gap. Here’s the list:

Assuming you don’t have a background in statistical genetics, you will probably not understand much of this list at this point. Throughout this series of posts, I’ll go through it and provide the needed background to understand these open problems. We’ll get up to speed with some of state-of-the-art genetics.

One of my premises here is that there are interesting and important analytical problems in this area, and that people with strong analytical skills could make meaningful contributions. Take for example open problem #14 (genotype-to-phenotype prediction performance). This is essentially a machine-learning problem: given someone’s genetics (x), make a good prediction of some human trait (y).

Problem #1 (population structure) is a classic case of statistical confounding. We find statistical associations between genetic factors (x) and a trait (y) but, rather than a causal effect x → y, this is often just the result of a confounding third variable (ancestry in this case) affecting both x and y.  For example, the trait of having red hair is much more common in Scotland than in Romania. It means that if we look for red-hair genetic associations across Europeans, we are likely to find pretty much any genetic marker that is more common in the Scottish subpopulation, even though the vast majority of these markers have nothing to do with hair color. There are ways to partially address this, but no perfect solutions.

Open problem #2 (non-additive and epistatic genetic effects) asks to what extent linear models are good approximations of genetic effects, and whether interesting nonlinearities can be found. For example, how common is it for a given trait (say alcoholism) to have two genetic factors A and B, such that having both will influence the trait, but neither A nor B is sufficient to make an effect on its own? 

I will stop here to not spoil the rest of the series (and also because more background is needed to really talk about these topics). The next post in the series will be a brief crash course on human genetics, where I’ll provide the necessary terminology and background. The post afterwards (the third in the series) will open our tour of statistical genetics and open problems.

To the next post→

Leave a comment

Your email address will not be published. Required fields are marked *