Thursday, August 28, 2014

Genomic cold fusion? Part I. Rational and irrational aspects of mapping

I’m sitting here on a smooth, quiet train from Zurich to Innsbruck, a few days after the mini-course that we taught in Helsinki. In this post I want to make a few reflections on things said by people reacting to Facebook or Twitter messages about the course, comments that were too short to do justice to what we actually said.

In particular, the issues have to do with the nature of genome mapping strategies and what they are or mean.  There seems to be a good bit of confusion in this area, perhaps because of a lack of proper explanation of what these methods do, and why and how they work.

First, nobody should be doing mapping, looking for genes causally responsible for traits, unless they have some legitimate reason for believing that a trait is substantially affected by genes—that is, that variation in the trait or risk of a trait like a disease is causally associated with variation in a particular spot in the genome.  Such a reason, at best, would be that the trait seems to segregate in families as if caused by a single Mendelian factor.  If the evidence is weaker than that—as it so often is—then mapping becomes the more problematic.

If we don’t know the part of the genome that affects the trait, then we use many measured variable sites, called markers, that span the genome with the idea that wherever the causal site is, it will be near one of our markers.  Essentially, that is, we are searching for statistically significant associations between the marker and trait, based on some basically subjectively chosen measure, like a p-value, in samples that we believe are appropriate for detecting causal effects.

What is perhaps not widely appreciated, is the nearly essential way that such searches rely on evolutionary assumptions.  We say ‘nearly’ because if one happens by huge luck to genotype the causal site itself, the test for association may be a bit more direct, as we’ll try to explain.

Mapping is based on evolutionary history
Evolution, or population history, generates the variation that causes the trait effect, and the variation we use as markers.  Mutational events generating these variants occur when they occur, and we choose markers based on the idea that they vary in our chosen type of sample, and that the instances of a given marker allele (variant) are descendant copies of some original mutation.  These instances of the same allele are said to be identical by descent (IBD) from that common ancestral copy.  Sets of instances of the marker also mark nearby chromosomal regions that have been passed down the same chain of descent.  That shared region is called a haplotype, and it gradually shortens over the post-mutation generations by a process called recombination.

If at some later time in the history of the haplotype ‘tagged’ by the marker variant another mutation occurs in a gene and alters that gene’s effects to generate the trait we are interested in, then the marker variant will be present in subsequent descendant copies of that twice-hit haplotype, and the causal signal will be associated with the presence of the marker variant.  This is called linkage disequilibrium (LD), and is the reason that mapping works.  That is, mapping works because of shared evolutionary (population) history of the marker and causal variants.

An hypothetical, simple example
[I’m continuing this post a couple of days from when I started it on the train to Innsbruck, and now finishing it in a nice hotel in Old Town, overlooking the Inn river.  Beautiful!]

Let’s say that we have a marker at which some people have a G nucleotide and others a T.   And let’s say the disease causal site, D, is near the G/T site, and that the D mutation, wherever it is on the chromosome, is near a copy of the chromosome that has the G on it at the marker site.  Then, what we hope is that the disease will be associated with the G—that enough more people with the disease will have the G than people without the disease.  This is the kind of association between trait-cause and marker that mapping is looking for.  But what can make it happen?

If we’re lucky everyone with the D allele at the causal site will have the trait (the ‘D’ mutation is fully penetrant, as we’d say).  And if there has been no recombination, and no other way to get the trait, then nobody with a T at the marker will also have the D variant—none of the T-bearers will have the disease.  Cases will have the G, controls the T.

This sort of perfect association depends on when the D-mutation, wherever it is on the chromosome, occurred relative to the mutation that produced the T at the marker.  We usually pick marker sites because we know that the variation (here, G vs T) is common in the population, and that means that the mutation is rather old.  Enough generations have passed for there to be a substantial fraction of T-bearing, and G-bearing people in the population.

If the ‘D’ mutation occurred right after this G-T marker’s mutation, then all copies of the G variant at the marker will also have the trait.  But if the trait-mutation occurred much later, then only a few of the G-bearing chromosomes will have the D-causing trait.  The association, even if true, will be weak.  If the D-site is far from the G-T marker site, then if the D-causing mutation occurred long enough ago for most G-bearers also to have the trait, but there’s a trap: in this case there will have  been enough time for recombination to switch the D-site onto a T-bearing marker chromosome.  The G-D association will no longer be perfect.

Likewise, if there are many different causes of the trait, then some cases will not be due to the D-variant (tagged by the G-allele at the nearby marker), even if the latter really is also a cause.  We’ll have cases with the T-marker variant, and in this case it’s not because of recombination.  The more causes of the trait the weaker the association between a specific marker, like the G-T one. 

Science or cold fusion?
So mapping is a multiple-edged sword.  Now, there are several ways to try to find trait-associated parts of the genome.  One is called linkage mapping, the other association mapping (genomewide association, or GWAS).  And one can also think that causal sites can be found not  by relying on linkage-disequilibrium, but simply by looking for causal variants directly.

These various strategies have their strong and weak points, and there is just as strong disagreement as to which to apply when.  That’s why someone can, sometimes sneeringly, claim that this or that approach is ‘cold fusion’—that it’s imaginary, and won’t or can’t work.  But since mapping for complex traits is not doing very well—as we’ve posted many times (and many others have repeatedly observed), we are usually explaining only a rather small if not trivial fraction of causation by mapping, the issues are serious, regardless of the vested interests of those contending with these issues.

In our next post we’ll discuss some of these issues about methods.

No comments: