Human genetics 101

The purpose of this post is to lay out some basic concepts and background in human genetics, which we’ll need to discuss the challenges in the field. If you are already familiar with molecular biology and want to jump straight ahead to the challenges, feel free to skip or gloss over this post.

As we’ll see, it’s almost as if our genomes were designed to be analyzed with data-science packages. Unlike most biological systems, our DNA is, literally, a digital code.

DNA

Deoxyribonucleic acid, commonly known as DNA, is the molecule carrying our genetic information across generations. It’s famous for its double helix shape.

DNA is made of building blocks called nucleic acids (or nucleotides). There are 4 different types of nucleotides (represented by distinct letters): adenine (A), cytosine (C), guanine (G) and thymine (T). DNA sequences can be represented as strings of A, C, G and T letters, for example ATGTTTGTTTTTC…

The DNA molecule is normally composed of two strands (hence the double helix shape), where nucleotides on one strand chemically bind to complementary nucleotides on the other strand. Specifically, due to the chemical properties of the four nucleic acids, A always binds to T, and C always binds to G. In other words, A & T and C & G are complementary nucleotide pairs.

Being composed of two complementary strands, DNA molecules are somewhat redundant: they contain exactly the same information twice. For example, seeing A and G on one strand, you can tell that the complementary strand should have T and C at these locations.

Because DNA is typically double-stranded, DNA nucleotides are often referred to as base pairs. Both “nucleotides” (abbreviated “nt”) and “base-pairs” (abbreviated “bp”) are commonly used as a measure for the length of DNA sequences. For example, one may refer to a “750 bp genomic region” (or, interchangeably, 750 nt), meaning a stretch of DNA made of 750 (pairs of) letters.

Chromosomes

Our genetic material is not made of a single, long DNA molecule. Rather, it’s made of a few dozens of long DNA molecules called chromosomes. Each human chromosome is a continuous, linear piece of DNA (of up to ~250 million base-pairs in length).

Humans are a diploid organism, meaning that we normally have two different copies of each chromosome (not to be confused with the two strands of DNA that each chromosome copy is made of). We inherit one set of chromosomes from our mother, and the other set from our father. While the two strands of complementary DNA contain exactly the same information, the two copies of each chromosome are only almost identical – they contain important differences.

The sex chromosomes, named X and Y, are special types of chromosomes that determine the sex of an individual. Females have two copies of the X chromosome, while males have one X and one Y chromosome (this is the only normal case of single-copy chromosomes). In addition to the sex chromsomes, we have 22 pairs of non-sex chromosomes, called autosomal chromosomes. The names of these chromosomes are even simpler: they are numbered 1 through 22.

Altogether, the human genome is made of 23 pairs of linear chromosomes. The total length of the human genome (not including copies) is ~3 billion letters. Taking into account the two copies of each chromosome, it becomes ~6 billion base-pairs. If we also consider the two strands of each DNA molecule (just for educational purposes – no geneticist would bother keeping track of that), we are talking about ~12 billion nucleic acids.

Genes, RNA & proteins

So how does our genome determine who we are? How do ~3 billion letters of DNA make meaningful things happen in our bodies? A key part of the answer is outlined in a beautiful but simple theory called the central dogma of molecular biology, which describes the flow of biological information inside each of our cells. First, specific stretches of DNA (called genes) are transcribed into a similar type of molecule called RNA, and some RNA molecules are translated into proteins (a different type of molecule).

So what is RNA exactly? RNA molecules are very similar to DNA, and consist of the same 4 nucleic acids (with the unimportant difference of a U letter instead of T). The most important type of RNA molecule is called mRNA, standing for messenger RNA. An mRNA molecule (transcribed from the matching piece of DNA) is translated into proteins.

Proteins are the building blocks of life. We are made of structural proteins such as collagen, the “glue” that connects and holds the cells in our body together, or keratin, which is a fiber making up tissues such as hair and fingernails. Other proteins can be thought of as tiny machines, which carry out most of the functions in our body. An enzyme is a special type of protein that catalyzes biochemical reactions, such as turning the sugar we eat into ATP (a form of energy available to our cells), or synthesizing other important molecules (most of what makes up our cells really). There are some 20,000 different types of proteins in the human body. This repertoire of little machines in our cells is the very thing that creates and maintains the miracle of life.

In some analogy to DNA and RNA molecules which consist of 4 types of nucleic acids, proteins are made of 20 types of amino acids (neglecting a few other, very rare nucleic/amino acids). While DNA is transcribed into RNA through a one-to-one mapping of the 4 nucleotides, an mRNA is translated into a protein sequence by a slightly more complicated three-to-one mapping, where each triplet of RNA nucleotides, known as a codon, is translated into a single amino acid. There are 64 (4 to the power of 3) unique triplets of RNA letters, but only 20 standard amino acids, meaning that the codon table, which determines this three-to-one mapping, is somewhat redundant: different codons can code for the same amino acids. For example, the UCA codon (made of a U RNA nucleotide followed by C followed by A) codes for the serine amino acid, abbreviated as either Ser (in three-letter code) or S (one-letter code). Five other codons, UCC, UCG, UCU, AGC and AGU, also code for serine.

The standard codon table (source: Wikipedia)

Like DNA and RNA, proteins can be thought of as 1-dimensional strings (of a 20-letter rather than 4-letter alphabet). But after they are synthesized, proteins tend to fold and form unique 3D structures. These 3D structures are determined by the 1D sequence of amino acids and unique chemical properties of each amino acid. The 3D structure of a protein then determines its function: how it binds other molecules (including other proteins), forms complexes, or carries out biochemical reactions in our body.

A cartoon of the 3D structure of a folded protein.

Each of the tens of trillions of cells in the human body contains an exact copy of the entire DNA¹, with exactly the same set of genes. Then how come that neurons and skin cells are so different? A rough answer is that, while all cells in the body have the same repertoire of genes, each cell expresses a unique subset of these genes. For example, only beta cells in the pancreas will produce insulin (unless one is diabetic). In other words, the relative amounts of mRNA molecules in a cell are very cell-type specific. How cells determine which genes they want to express is actually an extremely complex business, which we only partly understand. It involves regulatory sequences in the DNA, which give instructions about which genes should be expressed under which conditions. Some genes can activate or suppress the expression of other genes, forming complex gene regulatory networks.

Genomics & genetics

The picture I have just described, of how our genes encode the information of who we are, is more-or-less accurate, but quite simplistic, and ignores pretty much all the details. Truly understanding the inner workings of our genome, with all the genes and functional elements in it, is an amazingly broad domain of research called genomics. It’s a relatively new field that kicked off only a couple of decades ago, thanks to recent technological progress in molecular biology and nanotechnology (mainly the sequencing revolution, which has allowed reading DNA and RNA sequences at really impressive scales and analyzing those sequences with computers).

In this series, I’ll focus on a slightly different field, which overlaps a lot with genomics but is still quite distinct, called genetics. Unlike genomics, genetics has a very long history. The observation that people and other living things inherit and pass traits across generations has been around since ancient times. Gregor Mendel, through his famous series of pea plant experiments, formulated some of the basic laws of heredity, long before anyone knew about DNA. Needless to say, the genomic era has totally revolutionized our understanding of genetics and fundamentally changed the way we do genetic research. So the distinction between genomics and genetics is far from being clearcut.

Genetic variants

Although your DNA is about 99% identical to that of the person next to you, the two of you are probably quite different in quite many ways. Many of these differences could be attributed to the ~1% of your genomes that are not perfectly identical².

A site on the genome that is different between individuals is called a variant. For example, the variant chr13:32337326A>G indicates an A nucleotide substituted into a G at position 32,337,326 on chromosome 13. In other words, some individuals will have an A letter written at this position of their genome, whereas others will have a G letter.

A careful reader may wonder what position 32,337,326 on chromosome 13 means exactly, given that each of us has a somewhat different version of that chromosome (with a somewhat different DNA sequence and, probably, somewhat different length). As we all have the same chromosomes in our body, we can assign them with consistent names, so referring to “chromosome 13” is well defined. But because the exact DNA sequence is slightly different between individuals (and even between different cells within the same individual), counting 32,337,326 nucleotides from the beginning of chromosome 13 will not bring us to any consistent location in the genome. To circumvent the problem, geneticists use a global, consistent reference DNA sequence called the human reference genome. The reference genome describes, in some loose sense, what a normal genome looks like, and all genetic variants are described with respect to that reference. An important thing to understand about the reference genome is that it’s an artificial construct, and in most cases doesn’t fully reflect the DNA sequence of any real person. But it’s close enough to the genomes of real people to be useful and help us create a shared language for describing genetic variation. There are in fact multiple versions of the human reference genome, and in order for a variant like chr13:32337326A>G to be really well-defined, you’d have to also specify what version of the reference genome it refers to. Nowadays, the most commonly-used versions of the human reference genome are the older GRCh37 (also called hg19) from 2009 and the newer GRCh38 (also called hg38) from 2013. Improvements to the reference genome are made all the time, and sooner or later it’s very likely that a newer version will become more popular.

Each of a variant’s alternatives is called an allele. Geneticists will often refer to groups of individuals as, in the above example, carriers of the A or G allele. The vast majority of variants are biallelic, which means they have exactly two alleles: the reference allele (present in the reference genome) and the alternative allele³.

The word mutation is often used by geneticists quite interchangeably with “variant”⁴. The term “variant” is more likely to be used when referring to variation already present in the population, whereas “mutation” sometimes alludes to a relatively recent event that has just taken place (for example, DNA replication error leading to some genetic variation in a child that doesn’t exist in their parents).

Variant types (and their consequences)

What types of variants are there? By far the most common is a substitution of a single nucleotide with another (like the example above of A substituting into G). This type of variant is called single-nucleotide variant (or SNV for short). When an SNV occurs in the coding region of a gene (in other words, when it substitutes a nucleotide within a codon – one of the nucleotide triplets coding for an amino acid we mentioned earlier), the most likely outcome is either a missense or synonymous variant. Missense variants lead to the substitution of one amino acid by another. For example, if the UCA codon changes into GCA, then instead of a serine (S) amino acid we’ll get alanine (A) (you may look at the codon table above and validate it). On the other hand, a synonymous mutation has no effect on the amino-acid sequence of a protein. For example, if the UCA codon was replaced by UCG then, since both codons code for serine, nothing would really happen to the gene at the protein level⁵.

SNVs are the most common type of genetic variation, but there are other common, more complex differences, which are quite important. Short insertions and deletions of DNA base pairs could have pretty dramatic consequences. Any short, non-SNV change of nucleotides can be thought of as some combination of insertions and deletions, and is often referred to as an indel (insertion/deletion). For example, chr17:41276074AT>CCC would be a replacement of AT by CCC at position 41276074 on chromosome 17. In a coding region, when the difference between the number of inserted and deleted nucleotides is a multiple of 3, the protein-level outcome is usually a replacement (including insertions and deletions) of a few amino acids. Such a variant would be called an inframe variant. Otherwise, the variant creates a shift in the way the gene is partitioned into codon triplets, which completely changes all following codons and destroys the protein product altogether. This is called a frameshift variant.

Indels tend to have more dramatic effects than SNVs. In the above example, the AT>CCC replacement leads to a damaging frameshift in a gene called BRCA1. And still, missense variants (which only change a single amino acid) are often perfectly sufficient for destroying the function of a protein (or sufficiently changing it) and making a serious effect on individuals. For example, missense mutations in the SOD1 gene (such as G37R, an alteration of the 37th amino acid in the protein product of the gene, which is normally a glycine (G), into an arginine (R)) could lead to the development of ALS (amyotrophic lateral sclerosis) – an incurable neurodegenerative disease leading to paralysis and, eventually, death. That’s the disease that inflicted and eventually killed the famous cosmologist Stephen Hawking. Such a devastating outcome all because of a single DNA letter swapped at the wrong place.

I have mostly focused on the effects of variants in coding regions (that make proteins), which account for only about 1% of our genome. Most genetic variation actually occurs within the non-coding part of the genome, and could also lead to important effects, mostly by altering the regulation of genes (which is a far more complicated topic I won’t get into right now). If we specifically focused on variants that noticeably affect human traits (most variants don’t), we would find out that the relative fraction of variants in coding regions is significantly more than 1% (in other words, genetic effects are overrepresented in the coding genome), but not anywhere near 50%. That means that most genetic effects come from the non-coding part of the genome.

Allele frequency

An important property of variants is allele frequency (AF) – the percentage of each allele in the population. Because most variants are biallelic, you can simply refer to the AF of the variant, which is just the AF of the alternative allele. Usually, the reference allele is the one most common in the population (with AF > 50%), but sometimes the alternative allele is actually more common. The most common allele is called the major allele, and the other alleles are called minor alleles. In the biallelic case, the frequency of the minor allele is called the minor allele frequency (MAF). The MAF of a biallelic variant is easily calculated from its AF by: MAF = min{AF, 1 – AF} ≤ 0.5.

It is important to understand that AF and MAF are not fundamental properties of the variants; they necessarily refer to a specific population. Most variants are more frequent in specific ancestries, such as Europeans, Asians or Africans (each can be subdivided into subpopulations with their own allele frequencies).

MAF can be quite informative, for example for knowing whether there’s a real chance that a variant is pathogenic (leading to a disease). It’s extremely unlikely that a variant present in a significant portion of the population would be responsible for a severe disease in young people; evolution would quickly wipe it off. Geneticists often use the gnomAD database to get statistics on the frequencies of variants across different populations (for example suspicious variants found in a patient with some genetic disorder).

Allele frequencies of the variant chr2:25384113T>C (GRCh37) across the main populations represented in the gnomAD database.
https://gnomad.broadinstitute.org/variant/2-25384113-T-C?dataset=gnomad_r2_1

Genotype & phenotype

Genotype and phenotype are important concepts in genetics. Genotype refers to the genetics of an individual: what their sequences of DNA look like or, equivalently, what variants/alleles are present in their genome. Phenotype refers to some trait of an individual, such as eye color, height or Parkinson’s disease. Much of the focus of genetics (particularly statistical genetics) is to study how genotype affects phenotype.

For geneticists, “phenotype” and “trait” are pretty much interchangeable (I’ll use both). Some phenotypes, such as height, can be measured accurately and objectively. Others, such as agreeableness, much less so. Phenotypes can be measured on a continuous scale (e.g. grip strength) or binary (e.g. wearing glasses) or categorical (e.g. blood type). Often, a phenotype can be defined and measured in many reasonable ways. For example, smoking could legitimately be defined as either a binary trait (“do you smoke?”), continuous (“how many cigarettes do you smoke a day?”) or categorical (“Are you a heavy/light/non smoker?”).

When we refer to someone’s genotype, we could refer to either a specific variant (what alleles they have), a set of variants, or the whole genome – depending on the context. Because we have two copies of each chromosome (except for sex chromosomes in males, or rare genetic conditions), in order to fully specify the genotype of a variant we would need to keep track of both alleles. When the two alleles are identical, the genotype (at that variant) is homozygous, and when they are not it is heterozygous. A homozygous genotype can be further subdivided into homozygous reference when the present allele is the reference allele, or homozygous alternative if it’s an alternative allele. Because a homozygous reference indicates that an individual doesn’t have the variant (their genotype is perfectly identical to the reference gnome), when geneticists say that an individual is homozygous for a variant, they probably mean homozygous alternative.

The word “genotype” can also be used as a verb, to refer to the act of figuring out someone’s genotype. I will not elaborate on the process of genotyping here, only say that nowadays we have pretty solid and affordable (but still imperfect) technologies and computational tools that allow us to determine what variants are present in someone’s genome.

Analyzing genetic data – a quick demonstration

So we’ve focused a lot on the theory of human genetics and introduced a great deal of terminology. You may find all of this somewhat abstract and difficult to follow (thank you for reading so far into the post!). If you are like me, you may wish for something more down to earth.

Let’s take a pause from introducing concepts and look at some actual data. I will demonstrate how genetic data can be read and analyzed with Python code.

After reading an early version of this chapter, a friend reminded me that some people don’t know or don’t like programming, and advised me to make it more friendly to non-programmers. Personally, I find it easier to imagine people who don’t like puppies, but I do see my friend’s point and acknowledge that not using Python is a legitimate life choice and that such people also matter and they have a right to read and enjoy blog posts about human genetics just like everyone else. So to make this more inclusive and enjoyable for everyone, I will include the code snippets for all of the analysis I’m going to show, but I won’t explain the code. Instead I will focus my explanations on the outputs. Readers interested in the full technical details may find it in this notebook.

Ok, ready? Let’s go.

For this demonstration I will use genetic data from The 1000 Genomes Project (now also called The International Genome Sample Resource (IGSR), maybe because the project exceeded 1,000 genomes quite a while ago and settled on 2,504 genomes, and the name The Two Thousand Five Hundred and Four Genome Project wasn’t as catchy). This public dataset contains the full genomes (i.e. all genotyped variants) of 2,504 individuals who donated their genetic data. While the donors are anonymous, some basic demographic details are provided, such as the sex and ethnicity/ancestry of each individual.

We start by downloading three files: chr17.bed, chr17.fam and chr17.bim. These files contain all the genotyped variants on chromosome 17 across the 2,504 IGSR samples (in the context of genetic research, a sample typically means a study participant).

mkdir -p ~/IGSR/plink_format/
wget ftp://ftp.cs.huji.ac.il/users/nadavb/demo/IGSR/plink_format/chr17.bed -P ~/IGSR/plink_format/
wget ftp://ftp.cs.huji.ac.il/users/nadavb/demo/IGSR/plink_format/chr17.fam -P ~/IGSR/plink_format/
wget ftp://ftp.cs.huji.ac.il/users/nadavb/demo/IGSR/plink_format/chr17.bim -P ~/IGSR/plink_format/

Let’s start by looking at what’s inside these files. The following code shows the contents of the chr17.bim file (the first table), the content of chr17.fam (the second table), and meta information about the content of chr17.bed (the third table).

python -m pip install pandas-plink

import os
from IPython.display import display
from pandas_plink import read_plink
variants, samples, genotypes = read_plink(os.path.expanduser('~/IGSR/plink_format/chr17'))
display(variants)
display(samples)
display(genotypes)

Let’s go over these three outputs one by one.

The chr17.bim file contains information about the variants. There are a little over 2.3 million genotyped variants on chromosome 17. The first one is an SNV substituting C into A at position 52 of the chromosome; the last one is an insertion substituting A into AG at position 81,194,983 (with respect to the GRCh37 reference genome). In this particular file, the a1 column is actually the reference sequence, but this is often inconsistent in the BIM format (for some mysterious reason I don’t really understand).

As for the 2,504 samples (listed in the chr17.fam file), there isn’t anything very interesting going on there, so I will not dwell on it. The only field we are going to care about later on is the iid field, which provides a unique identifier for each sample. Notice that most of the columns are empty for this specific file (you shouldn’t conclude that all the samples are of the same gender).

As for the genotypes themselves (the content of the chr17.bed file), they are provided as essentially a giant matrix of 2,317,399 by 2,504 entries, where each entry is an integer value of either 0, 1 or 2. Each entry marks the number of alternative alleles of a given sample with respect to a given variant. Since loading 5,802,767,096 integers is somewhat excessive (and beyond the capacity of many computers), this huge matrix is not loaded into memory. Rather, specific rows and columns can be retrieved on demand. Let’s see some examples.

We’ll start by querying an arbitrary variant, say variant number 426,579.

display(variants.iloc[426579])

variant_genotypes = genotypes[426579, :].compute()
print(type(variant_genotypes), len(variant_genotypes), variant_genotypes)

import pandas as pd
display(pd.Series(variant_genotypes).value_counts().sort_index())

The first output shows the meta information for this variant (a row in the chr17.bim file). We can see that it’s a substitution of G to A at position 13,159,006 on chromosome 17. The second output shows 6 of the 2,504 genotypes for this variant. We can see the genotypes of the first 3 (0, 1 and 0) and last 3 samples (1, 1 and 0). The third output shows the overall number of individuals (among the 2,504 samples) with each of the three possible genotypes. We can see that genotypes at this variant are pretty evenly spread: there are 755 samples who are homozygous for the reference allele (GG), 679 who are homozygous for the alternative allele (AA), and 1,070 sample who are heterozygous (AG).

Ok, that’s very nice. What else can we do with this data?

Let’s try to visualize the 2,504 individuals by their overall genetic similarity. To do that, we’ll choose 1,000 random variants and use PCA to plot the genotypes of the individuals across these 1,000 variants in 2 dimensions. If you are not familiar with PCA, it will be better explained in the next post. What you need to know for now is that the algorithm represents each individual as a point in 2D space such that points close to each other represent more similar individuals (with respect to the 1,000 random variants we have chosen).

import numpy as np
np.random.seed(0)
chosen_variant_indices = np.random.choice(np.arange(len(variants)), 1000)
chosen_variant_indices.sort()
chosen_genotypes = genotypes[chosen_variant_indices, :].compute().transpose()
print('Extracted the genotypes of all %d individuals across %d variants.' % chosen_genotypes.shape)

from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
pcs = PCA(n_components = 2).fit_transform(chosen_genotypes)
fig, ax = plt.subplots(figsize = (6, 6))
ax.scatter(pcs[:, 0], pcs[:, 1], s = 7)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')

If the 1,000 variants we have chosen are representative, then the distance between each two points should be indicative of the overall genetic similarity of the individuals represented by these points.

It would be interesting to color each individual by their ethnic identity. To do that, we’ll download another file that provides some demographic details (gender and population) over the IGSR individuals.

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel

samples_meta = pd.read_csv('integrated_call_samples_v3.20130502.ALL.panel', sep = '\t')
display(samples_meta)

You can see that this file contains 2,504 rows, one for each sample, and specifies the “super population” (super_pop) of each sample (identified by the same sample code used in the iid field we saw earlier).

We’ll color the samples by one of the five super populations reported by IGSR: African (AFR), Mixed American (AMR), East Asian (EAS), European (EUR) and South Asian (SAS).

from matplotlib.patches import Patch
unique_pops = np.unique(samples_meta['super_pop'])
pop_to_color = {pop: plt.cm.Set1(i) for i, pop in enumerate(unique_pops)}
sample_pops = samples['iid'].map(samples_meta.set_index('sample')['super_pop'])
sample_pop_colors = sample_pops.map(pop_to_color)
fig, ax = plt.subplots(figsize = (6, 6))
ax.scatter(pcs[:, 0], pcs[:, 1], c = sample_pop_colors, s = 7)
ax.legend(handles = [Patch(color = pop_to_color[pop], label = pop) for pop in unique_pops])
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')

Beside the Mixed American (AMR) population which is, well, mixed, we can see that individuals cluster quite nicely according to their ancestry backgrounds. Unsurprisingly, individuals who share more recent ancestry are more genetically similar.

Linking genotype and phenotype

As we’ve seen, genetics allows us to glimpse into our distant pasts and observe the heavy footprints of human ancestry and evolution. But if we hope to learn how our genetics really affects us, like which genes determine our eye color, it is not sufficient to gaze at a set of variants. To study genetic effects, the genetic data would have to be integrated with some phenotypic data and we would have to study the statistical connections between the two. We would have to recruit cohorts of individuals and study their genotypes and phenotypes, which is the main focus of statistical genetics.

Mendelian vs. complex traits

Before I briefly present the main methods used in statistical genetics, let’s first talk about the main modes of inheritance, that is, how genetics affects traits. Traits affected by genetics (which is pretty much any meaningful trait you can think of) can be divided into Mendalian vs. complex traits. Mendelian traits are deterministically determined by genetics according to “Mendel’s laws of inheritance”. Which is a fancy way of saying that they are determined by specific alleles through dominant or recessive inheritance.

Chances are you’ve heard of “dominant” and “recessive”, but in case you haven’t, here goes. In dominant (Mendelian) inheritance, it is sufficient to have one effect-causing allele for the genetic effect to take place, whereas in recessive (Mendelian) inheritance two such alleles are needed (one on each copy of the relevant chromosome). For example, Huntington’s disease is dominant with respect to specific mutations in the HTT gene (on chromosome 4). A person with any disease-causing variant will almost inevitably get the disease, and will pass it to their children with 50% chance. Cystic fibrosis, on the other hand, is recessive. It is caused by mutations in the CFTR gene (on chromosome 7). One gets the disease only if both their parents carry disease-causing variants and pass it on to them.

If all traits were Mendelian, the lives of geneticists would be much easier, and I wouldn’t have any interesting analytical challenges to blog about. But as it turns out, most traits are non-Mendelian, also known as complex traits. In particular, most traits are very non-deterministic (from the perspective of genetics; if you are interested in philosophical discussions about determinism you will have to find it somewhere else). Not only are complex traits affected by a combination of numerous different variants all over our genome (most with very weak effects), they are also affected by the environment (including complex interactions between genetic and environmental effects).

The classic example of a complex trait is height. A person’s height is determined by thousands of genetic variants (the average contribution of most of which is measured in millimeters) as well as many factors from our early development (such as nutrition). Many of our physical characteristics (weight, build, strength) are complex traits. So are most psychological traits. Many diseases are complex too, including most of the chronic conditions prevalent in the industrialized world (like type 2 diabetes, Alzheimer’s, cardiovascular diseases and cancer).

The “non-deterministic” aspect of complex traits is most noticeable in binary complex traits, which is the straightforward way of thinking about most diseases (either you have it or you don’t). Let’s take cancer for example. Some people will have all the wrong variants and risk factors and still not get cancer, while others will get it for no apparent reason other than bad luck. Every time a cell divides in our body, there is a small chance for a tumor to start forming. If you have the wrong genetics, this probability can become much higher, but some margin will almost always remain.

Genetic studies

Having looked at how inheritance works, let’s go back to statistical genetics. What are some of the ways to study the genetics of human traits?

One approach is called genetic association studies. The basic idea is really straightforward:

Recruit a cohort of individuals with a phenotype that interests you.
Genotype these individuals.
Go over all the genotyped variants:
For each variant, see if it’s correlated with the phenotype.
Report the significant variants you have found and go home happy.

Let’s say that we are interested in heart disease. We recruit 2,000 volunteers, 1,000 with the disease (called cases) and 1,000 without it (called controls). We then look at one variant with a C or a T allele, and observe that the C allele is more prevalent in cases than in controls (it’s present in 62% of the cases, compared to 49% of the controls).

To make sure it’s not just random imbalance, we can use standard statistical methods to test for statistical significance of the variant and get a p-value. If the p-value is below a certain threshold, we’ll declare that we have found a genetic variant associated with heart disease and publish a nice paper (ok, maybe it was sufficient for publishing a paper a decade ago; standards have since improved a little and nowadays most journals will probably ask for more evidence, as we’ll see later).

As we’ll talk about in future posts, genetic association studies come in all kinds of flavors. You can run them at a per-gene or per-variant level; you can decide in advance to focus on specific parts of the genome (not that popular anymore), or you can run it genome-wide. Most commonly, genetics studies go after the entire genome and test all the variants that could be genotyped, an approach known as genome-wide association study, or GWAS for short (pronounced gee-was, with two syllables).

The results of GWAS are often displayed in Manhattan plots (which are named that way because they kind of look like Manhattan’s skyline):

What you see is essentially a scatter plot, with each dot representing a variant. The x axis marks the genomic location of the variants, by concatenating all 22 autosomal chromosomes one after the other (the sex chromosomes were not included in this particular study). The y axis shows the p-value that each variant obtained for its association with the phenotype (in minus log scale). A common significance threshold used in GWAS is 5e-08 (i.e. 5×10^-8), called the genome-wide significance threshold. Every variant with a p-value lower than this threshold or, equivalently, above the upper dashed line drawn in the Manhattan plot (remember it’s in minus log scale), is considered genome-wide significant. I will explain in a future post how people have come up with this specific number (for the more statistically inclined audience, think about it as sort of a Bonferroni correction).

You may notice that peaks on the Manhattan plot tend to come in clusters. In other words, if a variant has a very low p-value, chances are that many other variants around it also have very low p-values. This is because of a phenomenon known as linkage disequilibrium (or LD). I will cover this topic too in more detail in the future, but in short, LD is the phenomenon that variants that are in proximity to one another in the genome (say a few base pairs away on the same chromosome) tend to be correlated, because they are often inherited together from the same parent. So for example, if we have one variant with A/C alleles, and another variant, say 100 base pairs away, with G/T alleles, then a person who has the A allele in the first variant will tend to have the G allele in the second variant, and a person with a C will tend to have a T (or vice versa). This explains why genetic associations tend to come in clusters. If a variant is correlated with a phenotype (because it affects it), then all the other variants that are correlated with that variant will also show correlation with the phenotype, even if they don’t really affect it (generally, if x is correlated with y and y is correlated with z, then x will tend to be correlated with z). There is a lot more to say about genetic associations and LD, but for now let’s leave it at that.

GWAS and other association studies are probably the most common type of genetic studies, but there are other ways to analyze the statistical connections between genetic variation and human traits in interesting ways. It’s recently become quite popular to train polygenic risk scores (PRS for short), which are algorithms for predicting a given trait from the genetics of an individual; for example, assessing people’s risk for developing colorectal cancer, or predicting their expected height. Training a PRS is a pretty standard machine-learning problem, often done with pretty standard algorithms (but there are some complications, as is always the case with genetics).

In GWAS we are interested in mapping out the variants that affect a trait individually (by calculating the statistical significance and effect size of each of the variants), whereas in PRS we are interested in the overall combined effect of all the variants in the genome. Another important distinction is that in GWAS we are usually interested in finding causal variants that actually affect the trait (which is more than just being correlated with it), while in PRS we usually just care about predictive power. Knowing someone’s probability to get cancer can be very useful, whether or not the variants we use for assessing this probability are actually causal.

Another type of genetic studies that I will get to later in this series is heritability estimation. The goal in these studies is even more modest: we just want to know how heritable a given trait is. That is, to what extent it is the result of genetics (as opposed to other factors, including environmental effects).

I think this will be a good place to stop. We have definitely covered a lot of ground (human genetics is a big topic). In the next post I will start delving into the more analytical aspects of human genetics, and start presenting some of the challenges in the field. We’ll take a look at the current state of the art in statistical genetics and how we might be able to push it a bit further. It will be interesting.

To the next post→

Footnotes

There are special types of cells, most notably red blood cells, which don’t contain a copy of the DNA. Also, due to occasional mutations, two cells in the body can end up with DNA which is not 100% identical (strictly speaking).
Many give 0.1% as the average genetic difference between individuals. Exact quantification is actually pretty complicated, and depends on how you define ”difference”. See for example this discussion.
If one wants to, variants with more than two alleles can also be interpreted as a set of independent biallelic variants (each considering only one of the alternative alleles).
Another term commonly used interchangeably with “variant” is single-nucleotide polymorphism (SNP). Originally, as the name suggests, it meant pretty much the same thing as single-nucleotide variant (substitution of a single nucleotide by another), with the difference that many geneticists argued that for a variant to be counted as a SNP it had to be sufficiently common (1% or more). Informally, many scientists spoke about SNPs (pronounced with a single syllable, sneeps) to refer to pretty much any variant. Personally, I find that using the word SNP tends to only create confusion, and usually avoid it in writing. In particular, I find the introduction of a hard (1% or whatever) threshold in the very definition of the term to be awkward and inconvenient, especially given that allele frequency is population-specific (and sample-specific), so we end up with a really ill-defined concept.
It is in fact an interesting property of the codon table that most SNVs at the third nucleotide of a codon don’t change the amino-acid product.

DNA