Friday, May 17, 2013

Of mice and men: Genes, environment, and whatever

Is the nature/nurture question finally solved? Are we who we are as a consequence of luck?  A paper in Science, Freund et al., "Emergence of Individuality in Genetically Engineered Mice," assesses the effects of the nonshared environment on neural and behavioral development in 40 inbred female mice living in an enriched environment compared with genetically identical mice in a non-enriched environment.  Does the plasticity of the brain in response to environmental stimuli go a long way toward explaining who we are? 

Freund et al. write:
...the emergence of experience-based individual differences within groups of genetically identical animals exposed to the same enriched environment has rarely been addressed. We used a large group of animals and a particularly complex environment to capture the emergence of individual differences in brain and behavior over time. We used exploration as a marker of behavioral development, and adult neurogenesis in the hippocampus as a marker for continued brain development.
  This figure from the paper gives a schematic of the experimental set-up.

Experimental setup and effects on body and brain weight.
(A) Schematic illustration of the large enrichment enclosure housing 40 mice including RFID antenna positions (shown as red rings). Positions of levels, water sources, nesting boxes, and connecting tubes are drawn to scale. (Inset) Schematic illustration of animal tracking; an RFID passive integrated transponder (PIT) is implanted in mouse’s neck. The electromagnetic field issued by the antenna induces the PIT to emit the number identifying the animal. This information is then picked up by the antenna and stored into a database together with spatial and temporal annotations. (B) Experimental time line. (C) Body weight development: weights (in grams) of CTR (blue) and ENR (red) mice at the beginning and end of the experiment. (D) Brain weights at perfusion (in grams). The difference in variance between CTR and ENR missed conventional statistical significance at P = 0.057. Source: Freund et al., 2013, Science.
After 3 months, mice in the enriched environment were heavier, with more variability in brain and body size than the control mice (though, there were only 8 control mice) and more neuronal connections were made in the hippocampus of the enriched adult mice than in the brains of the controls. "This finding supports the idea that the key function of adult neurogenesis is to shape hippocampal connectivity according to individual needs and thereby to improve adaptability over the life course and to provide evolutionary advantage."

While the authors point out a number of ways in which these mice may not in fact be genetically identical, primarily, they suggest, because of epigenetic changes due to such variables as position in the uterus, maternal disease, nutrition and interactions with the mother, maternal imprinting and so on (the 40 mice were randomly picked from different litters of the same strain of inbred mice to minimize these kinds of effects), they still consider them to be "identical."

But...wait a second!  Are they identical?
Much as we like the authors' general conclusion, because it moves away from the excessive level of imputed genetic determinism of our traits, we must add a word of caution.  These mice were actually not even genetically identical. Every time a cell divides, mutation is likely to happen. Based on some estimates from various kinds of data, that can be about 150 changes per cell division.

Such mutational variation accumulates from conception to an animal's sperm or egg production, and is thus transmitted across generations.  This is true, of course, even in inbred laboratory animals.  So even in a litter of 'identical' pup embryos, there is genetic variation.  But the picture is even more complex.

Every cell division during a mouse's (or your) lifetime, mutations occur.  If we assume that the rate is roughly as above, and even a mouse has millions if not billions (and you have billions if not a trillion or so) of cells, there is a lot of genetic variation within an organism.  Once such a somatic (body cell rather than germ cell) mutation occurs, when that cell divides its daughter cells, throughout future  life of the organism, inherit the change.  Thus, the earlier during embryological development that a somatic mutation occurs, the larger the tree of cellular descent--the more organs or larger the part of a developing organ--that will inherit the change.

Now most of these by far will occur in unimportant areas of the genome--not in any actual genes at all, or in genes that aren't used in the tissues in which the mutation has occurred.  But this cannot be assumed as a general fact and one has to consider the nearly inevitable likelihood that whatever trait is being studied, the animals, even inbred lab animals, are not genetically identical in respect to it.

The challenge here is to identify the variation and figure out if it matters to the trait.  To do that one needs to do tissue-specific, if not detailed individual cell-specific genome sequencing, and even then one needs to identify gene expression at the cell level--and maybe (probably) at different times during development or environmental exposure changes--to attempt to identify those genomic elements whose variation might be involved.   This is essentially impossible as a general rule, and certainly only under some unusual circumstances would it even be worth undertaking.  And, of course, you'd kill the animal in the process, so its behavior would be (only) in the mind of the investigator!  Even to justify the cost and effort, without this minor mortal stumbling block, one would once again have to believe that slicing and dicing the genome, cell by cell, minute by minute, will explain complex traits.

Of mice and men....
Is there really such variation in inbred animals?  Well, we used ForSim, a forward evolutionary simulation program developed in my lab by me and Brian Lambert, to simulate a mouse experiment.  Two independent lines of mice were simulated, using many genes and a lot of DNA to get enough statistical stability in the results.  After generating a normal level of variation, as seen in wild animals and people, we simulated selection of some trait in the opposite direction (small trait values in one, large in the other strain), and then inbreeding for a large number of generations (around 200) with the population kept small, roughly as inbred lines are developed.  Then, we examined these simulated animals for sequence variation and we found a substantial amount of it: roughly as many different sites were varying among the animals as had been fixed by selection and inbreeding.  Yet in the usual mapping and experimental approaches such variation is assumed not to exist.  But it does.

Hey, are we against genes or for genes, after all??
Given our predilection for criticizing what we believe are excessive claims of genetic causation, one has to think carefully and avoid oversimplifying.  The argument cuts all ways.  We must therefore also raise similar cautionary questions about excessive dismissal of genomic effects!

Today, our point is that even here, where traits vary to a surprising extent even in putatively identical animals, it cannot really all be attributed to 'chance' (unless that includes mutations), nor to learning, nor environment.  The causal mix is inextricably complex under widespread if not most conditions.

It is for reasons such as revealed by this study in what otherwise is a clear demonstration, that we write so often to try to temper the enthusiasm for genetically deterministic thinking, much less such gene-based predictions of individuals' futures.  But genes do vary, and they vary subtly.  There is no one crystal ball, not even for mice!

Thursday, May 16, 2013

Breast cancer, and probablities, in the news (again)

Angelina Jolie's New York Times editorial on her decision to have bilateral mastectomies when she tested positive for BRCA1 mutations associated with high risk of breast cancer brings up a number of issues.  We have previously commented on the problem of assessing competing risks, in the context of debates about screening, detection, and risk associated with breast cancer.  This is so common, and so serious, a problem that it naturally draws a lot of attention.   Just as naturally, it involves many sorts of probabilities: does a given test detect actual cancer?  Does every cancer need to be detected, or will some go away spontaneously?  And so on.

Under these kinds of conditions, the balance between costs, risks, important detection,  treatment options and the like all involve probabilities.  For example, Jolie writes that she was given an 87% probability of eventually having cancer.  Projecting risks--your net future in regard to this disease--is of vital interest, and because most of the probabilities involved are very inaccurately known, it is even a problem to know whether or to what extent to believe the probability estimates we already have. That is also why the same thing seems to require study over and over and over, without clear results.

In the end, for most women, and their physicians, whether they know it or not, they and their lives and health are highly dependent on the statistical aspects of studies to estimate and assess a wealth of probabilities.  In such cases, there are important, often wildly misunderstood or misapplied statistical approaches, and they often yield probabilities whose accuracy is not high or is even unknown.  Yet the point of statistical and probabilistic analysis is to make decisions about the state of the world.  If that is important, then how we interpret the results is important, but so, to a fundamental extent, is how we come to our results and interpretations in the first place.

This is so serious and widepread an issue in science, that we posted a very fine 2-page primer on statistical design and the basic nature of probabilistic inference that was written for MT by our very knowledgeable colleague, Jim Wood, here in our Department.

And yet, the Jolie story shows that under some circumstances, it is unnecessary and perhaps even wrong to worry about the details.  She decided to undergo double mastectomy as a preventive against breast cancer.  That is, she has decided that in a sense she already had breast cancer in the sense that it was a ticking time bomb in her genome.  Prevention being better than cure, she made that awesomely serious decision.

Jolie discovered that she carried one of the known variants in the BRCA1 gene that confer very high risk of breast (and ovarian) cancer.   She said she was told that the risk she'd get breast cancer was over 85%.  Now, there are actually major uncertainties about even this risk, as different cohorts of women (that is, born in different places or times) have very different risks as estimated by retrospective studies of women known to carry, compared to those known not to carry, such variants.  For some cohorts, the risk by age 60 or so has been estimated at only about half that in other cohorts.

Yet, here it doesn't matter and there is no need to worry about statistical finery or even the specific risk estimate.  Why?  Because under all the established risk scenarios, these mutations confer extremely high, potentially lethal risk.  It matters not at all, at least to most of us, whether a risk is 90% or 50%, if the risk is avoidable and the consequences dire.

Further, the BRCA1 gene function is basically known: it relates to corrections of DNA copying mistakes.  If miscopied DNA is not repaired, the risk is that in some breast cell a mutation will arise that leads the cell to be transformed into a cancer cell, that then proliferates and spreads.  So here we have not just estimates of very high risks, whatever they are, but also a mechanism.  And we have replicated findings in different populations.  So here, that these specific variants are truly risk factors themselves, rather than just being associated with some unmeasured factor, is pretty convincing--convincing enough to bet your life on it.

The nature of epidemiological risks
The cohort dependence of the BRCA1 risk for those with the clear-cut, well-studied variants, raises a very important point, one we've mentioned before.  Risks are expressed in terms of the likelihood of future events.  But how do we know what those are?  The answer is that we generally only know them from the past.  That is, we do retrospective studies, that compare those with or without exposure to a putative risk factor (here, a genetic variant) and see what happened to them.  We estimate how much more happened to those with, compared to those without, exposure to the risk factor.  But how can we predict future risk from such data?  The answer is basically an assumption that might be called uniformitarianism (a term related to the history of geology and that led Darwin to his insight about evolution): we assume that in all relevant ways, the future will be like the past.

That means that we assume that exposure to the same risk factors in the future will have the same effects as exposure did in the past (which we discovered from our sample of cases and controls, etc.).  But this assumes that we measured all the relevant factors in the past and, much more importantly, it assumes that people with the genetic risk factor will be exposed to same other factors in the future to an extent that justifies our uniformitarian extrapolation.

However, even if that were the case, we do not understand the many factors well enough and cannot, even in principle, know what exposures will be like in the future.  We simply do not know what our lifestyles will be like.  So, we have no way to make accurate risk  predictions.  It is, like the parts of the universe beyond which light cannot get here for us to see, literally beyond our reach.

This is why BRCA1 variants are 'lucky', in that whatever happens in regard to future lifestyles, there is no known scenario in which these variants would not also seem to confer very high risk.  The same cannot be said of the vast majority of genetic risk factors that are known today.  For them, risk estimates such as various genome-testing companies, or NIH's drive for 'personalized genomic medicine' are misleading--to an unknown extent.

It needs to be pointed out in this context that there are hundreds of other mutational variants in the BRCA1 gene (and in a handful of others, one of which is BRCA2) that are so rare and or were found only in patients' tumor cells, that we really cannot legitimately attribute causation to them.  Indeed, if they are too rare (as most are) we cannot apply statistical tests to even estimate risk.  All we can do is assume that the gene is relevant and therefore that the variant we find is causal.  Those variants are listed in disease-gene data bases as if they are causal, but that verges on simply being circular: assume a gene is causal and then conclude that the variant in that gene is therefore a cause.  That's bad reasoning.

Again, even here there are environmental (that is, non-genetic) risk factors that are not well established that may make an even larger proportional difference in risk, so that even if these various mutations are in fact causal in some way, that may depend entirely on the environmental context.  If so, that way is highly probabilistic and much farther from certain than the known variants in these genes. Or, some may be exceedingly dangerous, but not statistically demonstrable in the sense of probability that Jim Wood posted about in his excellent primer and discussion yesterday.

In fact, even here one can ask why the same BRCA1 mutations do not cause comparably elevated risks for any and all tissues in the body.  Such associations are generally low and not well established.  So here, the mechanism that seems to be known (DNA repair) should predict--should lead to a prior expectation of--high cancer risk in any tissue in which the gene's expressed.  In a standard 'Bayesian' analysis, the lack of strong effect in other tissues could actually undermine our confidence in our causal expectation for breast cancer itself.  Perhaps explanations for this exist, but we don't know of them.

Unfortunately, for most women there is no pre-smoking gun to guide what here are preventive decisions.  So while it's very unlucky to have inherited such a variant as Jolie did, in a strange sense she was very lucky.  At least she knew.  About 9-10 percent of women in developed countries (that is, where this has been studied) are at risk for breast cancer.  A close friend of ours has just had the same kind of operation as Jolie, but after a tumor was already found, and without carrying the known risk-variants.  Fortunately, although as we noted in our earlier post on this subject (link given above), the story is not entirely rosy, at least there are treatments that can be effective, even after the cancer has already occurred.

Everyone probably knows people who have been affected by breast cancer, and for many it's in their own families.  But unlike those with the clear-cut variants, they must face the kinds of exquisitely difficult decisions, based on very poorly understood, or inaccurate to unknown extent, competing probabilities.  For them, and for most of us in regard to the various disease time bombs silently ticking away inside us, the fine points of statistical analysis really are matters of life and death.  And they are fine points that nobody really understands.....and those who claim to are being misleading.

Wednesday, May 15, 2013

Let's Abandon Significance Tests

By his own admission, this is not only the first blog post Jim Wood has ever written, it's also the first one he's ever read. We're hoping not the last of either, given this debut. Jim is a biological anthropologist and demographer in the Penn State Dept of Anthropology, a man of many interests. His statistical knowledge is deep and of course informs his every academic endeavor, from endocrinology to infectious disease to household agriculture and beyond. We think every student should print this post and carry it around with them everywhere.  Well, anyone.
------------------------

Ronald Aylmer Fisher
   (1890-1962)


By Jim Wood

It’s time we killed off NHST.
 

NHST (also derisively called “the intro to stats method”) stands for Null Hypothesis Significance Testing, sometimes known as the Neyman-Pearson (N-P) approach after its inventors, Jerzy Neyman and Egon Pearson (son of the famous Karl). There is also an earlier, looser, slightly less vexing version called the Fisherian approach (after the even more famous R. A. Fisher), but most researchers seem to adopt the N-P form of NHST, at least implicitly – or rather some strange and logically incoherent hybrid of the two approaches. Whichever you prefer, they both have very weak philosophical credentials, and a growing number of statisticians, as well as working scientists who care about epistemology, are calling – loudly and frequently – for their abandonment. Nonetheless, whenever my colleagues and I submit a manuscript or grant proposal that says we’re not going to do significance tests – and for the following principled reasons ­­– we always get at least one reviewer or editor telling us that we’re not doing real science. The demand by scientific journals for “significant” results has led over time to a substantial publication bias in favor of Type I errors, resulting in a literature that one statistician has called a “junkyard” of unwarranted conclusions (Longford, 2005).


Jerzy Neyman
(1894-1981)
Let me start this critique by taking the N-P framework on faith. We want to test some theoretical model. To do so, we need to translate it into a statistical hypothesis, even if the model doesn’t really lend itself to hypothesis formulation (as, I would argue, is often the case in population biology, including population genetics and demography). Typically, the hypothesis says that some predictor variable of theoretical interest (the so-called “independent” variable) has an effect on some outcome variable (the “dependent” variable) of equal interest. To test this proposition we posit a null hypothesis of no effect, to which our preferred hypothesis is an alternative – sometimes the alternative, but not necessarily. We want to test the null hypothesis against some data; more precisely, we want to compute the probability that the data (or data even less consistent with the null) could have been observed in a random sample of a given size if the null hypothesis were true. (Never mind whether anyone in his or her right mind would believe in the null hypothesis in the first place or, if pressed on the matter, would argue that it was worth testing on its own merits.)   

Egon Pearson
(1895-1980)
Now we presumably have in hand a batch of data from a simple random sample drawn from a comprehensive sample frame – i.e. from a completely-known and well-characterized population (these latter stipulations are important and I return to them below). Before we do the test, we need to make two decisions that are absolutely essential to the N-P approach. First, we need to preset a so-called Î± value for the largest probability of making a Type I error (rejecting the null when it’s true) that we’re willing to consider consistent with a rejection of the null. Although we can set Î± at any value we please, we almost inevitably choose 0.05 or 0.01 or (if we’re really cautious) 0.001 or (if we’re happy-go-lucky) maybe 0.10. Why one of these values? Because we have five fingers – or ten if we count both hands. It really doesn’t go any deeper than that. Let’s face it, we choose one of these values because other people would look at us funny if we didn’t. If we choose Î± = 0.10, a lot of them will look at us funny anyway.

Suppose, then, we set Î± = 0.05, the usual crowd-pleaser. The next decision we have to make is to set a β value for the largest probability of committing a Type II error (accepting the null when it’s not true) that we can tolerate. The quantity (1 – β) is known as the power of the test, conventionally interpreted as the likelihood of rejecting a false null given the size of our sample and our preselected value of Î±. (By the way, don’t worry if you neglect to preset β because, heck, almost no one else bothers to – so it must not matter, right?) Now make some assumptions about how the variables are distributed in the population, e.g. that they’re normal random variates, and you’re ready to go.

So we do our test and we get p = 0.06 for the predictor variable we’re interested in. Damn. According to the iron law of Î± = 0.05 as laid down by Neyman and Pearson, we must accept the null hypothesis and reject any alternative, including our beloved one – which basically means that this paper is not going to get published. Or suppose we happen to get p = 0.04. Ha! We beat a, we get to reject the null, and that allows us to claim that the data support the alternative, i.e. the hypothesis we liked in the first place. We have achieved statistical significance! Why? Because God loves 0.04 and hates 0.06, two numbers that might otherwise seem to be very nearly indistinguishable from each other. So let’s go ahead and write up a triumphant manuscript for publication. 
Significance is a useful means toward personal ends in the advance of science – status and widely distributed publications, a big laboratory, a staff of research assistants, a reduction in teaching load, a better salary, the finer wines of Bordeaux…. [S]tatistical significance is the way to achieve these. Design experiment. Then calculate statistical significance. Publish articles showing “significant” results. Enjoy promotion. (Ziliak and McCloskey, 2008: 32)
This account may sound like a crude and cynical caricature, but it’s not – it’s the Neyman-Pearson version of NHST. The Fisherian version gives you a bit more leeway in how to interpret p, but it was Fisher who first suggested p less than or equal to  0.05 as a universal standard of truth. Either way, it is the established and largely unquestioned basis for assessing hypothesized effects. A p of 0.05 (or whatever value of a you prefer) divides the universe into truth or falsehood. It is the main – usually the sole – criterion for evaluating scientific hypotheses throughout most of the biological and behavioral sciences.

What are we to make of this logic? First and most obviously, there is the strange practice of using a fixed, inflexible, and totally arbitrary a value such as 0.05 to answer any kind of interesting scientific question. To my mind, 0.051 and 0.049 (for example) are pretty much identical – at least I have no idea how to make sense of such a tiny difference in probabilities. And yet one value leads us to accept one version of reality and the other an entirely different one.

To quote Kempthorne (1971: 490):
To turn to the case of using accept-reject rules for the evaluation of data, … it seems clear that it is not possible to choose an a beforehand. To do so, for example, at 0.05 leads to all the doubts that most scientists feel. One is led to the untenable position that one’s conclusion is of one nature if a statistic t, say, is 2.30 and one of a radically different nature if t equals 2.31. No scientist will buy this unless he has been brainwashed and it is unfortunate that one has to accept as fact that many scientists have been brainwashed.

Think Kempthorne’s being hyperbolic in that last sentence? Nelson et al. (1986) did a survey of active researchers in psychology to ascertain their confidence in non-null hypotheses based on reported p values and discovered a sharp cliff effect (an abrupt change in confidence) at p = 0.05, despite the fact that p values change continuously across their whole range (a smaller cliff was found at p = 0.10). In response, Gigerenzer (2004: 590) lamented, “If psychologists are so smart, why are they so confused? Why is statistics carried out like compulsive hand washing?”

But now suppose we’ve learned our lesson: and so, chastened, we abandon our arbitrary threshold a value and look instead at the exact p value associated with our predictor variable, as many writers have advocated. And let’s supposed that it is impressively low, say p = 0.00073. We conclude, correctly, that if the null hypothesis were true (which we never really believed in the first place) then the data we actually obtained in our sample would have been pretty unlikely. So, following standard practice, we conclude that the probability that the null hypothesis is true is only 0.00073. Right? Wrong. We have confused the probability of the data if you are given the hypothesis,  P(Data|H0), which is p, with its inverse probability P(H0|Data), the probability of the hypothesis if you are given the data, which is something else entirely. Ironically, we can compute the inverse probability from the original probability – but only if we adopt a Bayesian approach that allows for “subjective” probabilities. That approach says that you begin the study of some prior belief (expressed as a probability) in a given hypothesis, and adjust that in light of your new data.

Alas, the whole NHST framework is by definition frequentist (that means it interprets your results as if you could do the same study countless times and your data are but one such realization)  and does not permit the inversion of probabilities, which can only be done by invoking that pesky Bayes’s theorem that drives frequentists nuts. In the frequentist worldview, the null hypothesis is either true or false, period; it cannot have an intermediate probability assigned to it. Which, of course, means that 1 – P(H0|Data), the probability that the alternative hypothesis is correct, is also undefined. In other words, if we do NHST, we have no warrant to conclude that either the null or the alternative hypothesis is true or false, or even likely or unlikely for that matter. To quote Jacob Cohen (1994), “The earth is round (< 0.05).” Think about it.

(And what if our preferred alternative hypothesis is not the one and only possible alternative hypothesis? Then even if we could disprove the null, it would tell us nothing about the support provided by the data for our particular pet hypothesis. It would only show that some alternative is the correct one.)

But all this is moot. The calculation of p values assumes that we have drawn a simple random sample (SRS) from a population whose members are known with exactitude (i.e. from a comprehensive and non-redundant sample frame). There are corrections for certain kinds of deviations from SRS such as stratified sampling and cluster sampling, but these still assume an equal-probability random sampling method. This condition is almost never met in real-world research, including, God knows, my own. It’s not even met in experimental research – especially experiments on humans, which by moral necessity involve self-selection. In addition, the conventional interpretation of p values assumes that random sampling error associated with a finite sample size is the only source of error in our analysis, thus ignoring measurement error, various kinds of selection bias, model-specification error, etc., which together may greatly outweigh pure sampling error.

And don’t even get me started on the multiple-test problem, which can lead to completely erroneous estimates of the attained “significance” level of the test we finally decide to go with. This problem can get completely out of hand if any amount of exploratory analysis has been done. (Does anyone keep careful track of the number of preliminary analyses that are run in the course of, say, model development? I don’t.) As a result, the p values dutifully cranked out by statistical software packages are, to put it bluntly, wrong.

One final technical point: I mentioned above that almost no one sets a β value for their analysis, despite the fact that β determines how large a sample you’re going to need to meet your goal of rejecting the null hypothesis before you even go out and collect your data. Does it make any difference? Well, one survey calculated that the median (unreported) power of a large number of nonexperimental studies was about 0.48 (Maxwell, 2004). In other words, when it comes to accepting or rejecting the null hypothesis you might as well flip a coin.

And one final philosophical point: what do we really mean when we say that a finding is “statistically significant”? We mean an effect appears, according to some arbitrary standard such as < 0.05, to exist. It does not tell us how big or how important the effect is. Statistical significance most emphatically is not the same as scientific or clinical significance. So why call it “significance” at all? With due deference to R. A. Fisher, who first came up with this odd and profoundly misleading bit of jargon, I suggest that the term “statistical significance” has been so corrupted by bad usage that it ought to be banished from the scientific literature.

In fact, I believe that NHST as a whole should be banished. At best, I regard an exact p value as providing nothing more than a loose indication of the uncertainty associated with a finite sample and a finite sample alone; it does not reveal God’s truth or substitute for thinking on the researcher’s part. I’m by no means alone (or original) in coming to this conclusion. More and more professional statisticians, as well as researchers in fields as diverse as demography, epidemiology, ecology, psychology, sociology, and so forth, are now making the same argument – and have been for some years (for just a few examples, see Oakes 1986; Rothman 1998; Hoem 2008; Cumming 2012; Fidler 2013; Kline 2013).

But if we abandon NHST, do we then have to stop doing anything other than purely descriptive statistics? After all, we were taught in intro stats that NHST is a virtual synonym for inferential statistics. But it’s not. This is not the place to discuss the alternatives to NHST in any detail (see the references provided a few sentences back), but it seems to me that instead of making a categorical yes/no decision about the existence of an effect (a rather metaphysical proposition), we should be more interested in estimating effect sizes and gauging their uncertainty through some form of interval estimation. We should also be fitting theoretically-interesting models and estimating their parameters, from which effect sizes can often be computed. And I have to admit, despite having been a diehard frequentist for the last several decades, I’m increasingly drawn to Bayesian analysis (for a crystal-clear introduction to which, see Kruschke, 2011). Thinking in terms of the posterior distribution, of support for a model provided by previous research as modified by new data seems a quite natural and intuitive way to capture how scientific knowledge actually accumulates. Anyway, the current literature is full of alternatives to NHST, and we should be exploring them.

By the way, the whole anti-NHST movement is relevant to the “Mermaid’s Tale” because most published biomedical and epidemiological “discoveries” (including what’s published in press releases) amount to nothing more than the blind acceptance of p values less than 0.05. I point to Anne Buchanan’s recent critical posting here about studies supposedly showing that sunlight significantly reduces blood pressure. At the < 0.05 level, no doubt.



REFERENCES

Cohen, J. (1994) The earth is round (p < 0.05). American Psychologist 49: 997-1003.

Cumming, G. (2012) Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge.

Fidler, F. (2013) From Statistical Significance to Effect Estimation: Statistical Reform in Psychology, Medicine and Ecology. Routledge, New York.

Gigerenzer, G. (2004) Mindless statistics. Journal of Socio-Economics 33: 587-606.

Hoem, J. M. (2008) The reporting of statistical significance in scientific journals: A reflexion. Demographic Research 18: 437-42.

Kempthorne, O. (1971) Discussion comment in Godambe, V. P., and Sprott, D. A. (eds.), Foundations of Statistical Inference. Toronto: Holt, Rinehart, and Winston.

Kline, R. B. (2013) Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. Washington: American Psychological Association.

Kruschke, J. K. (2011) Doing Bayesian Analysis. Amsterdam: Elsevier.

Longford, N. T. (2005) Model selection and efficiency: Is “which model…?” the right question? Journal of the Royal Statistical Society (Series A) 168: 469-72.

Maxwell, S. E. (2004) The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods 9: 147-63.

Nelson, N., Rosenthal, R., and Rosnow, R. L. (1986) Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist 41: 1299-1301.

Oakes, M. (1986) Statistical Inference: A Commentary for the Social and Behavioral Sciences. New York: John Wiley and Sons.

Rothman, K. J. (1998) Writing for Epidemiology. Epidemiology 9: 333-37.

Ziliak, S., and McCloskey, D. N. (2008) The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor: University of Michigan Press.

Tuesday, May 14, 2013

Hot dogs are good for the heart after all!

Well, when it comes to causation of biological effects, you never know.  Now, the latest off the Big News ticker is that dogs are hot when it comes to the heart.  That is, having a dog or other pet reduces one's risk of having a coronary.  Hot diggety dog!  We had begun worrying whether the furry nuzzlers were going to be the next "studies show that..." story to dampen our lust for what makes life interesting. 


Fortunately, our fears were laid to rest by the NY Times story that reassures us that our leash on life is perhaps longer than we had thought.

Now this high-powered study funded by the American Heart Association (we hope the donors to the AHA don't object to the serious purpose to which their donations are used) provides comfort that will last at least until the next story hits the headlines; hopefully, that'll be more than just a day or two, but we can't hope it'll last a dog's lifetime.

How on earth are we to evaluate causation when every day some new major cause is identified?  Do we now need to add dog-ownership to smoking, alcohol consumption, frequency of sexual intercourse, marital status, red meat, eggs, male baldness, stomach flab thickness, and who knows what else, and--you didn't think we'd forget to include it!--our genometype to be able to predict to within a range of about 80% whether we'll have a heart attack or not.

Since every study concludes not with the phrase The End but "More research is needed, the lead investigator says", we assume that the AHA now plans to do the same study again not just for parakeets, tropical fish, hamsters, and kitty cats, but also for each dog breed, and for female and male (and spayed and neutered) dogs, and whether adopted as a pup from the kennel or as a stray from the pound.  That will mean a juicy menu of probably 1,000 separate studies (4 sexual conditions, 2 adoption conditions, and around 125 breeds), which is great for the research welfare system (not to mention the range of cat, bird, fish, etc. studies).

Unfortunately, another study just about to be published (we hear) will show that investigators will not live long enough to complete such studies, and that one epidemiology team is planning a study of whether doing such endless frustrating studies itself is yet another risk factor for heart attacks.  Hot diggety dog!

Oh, well, maybe some day someone will figure out a better way to investigate complex causation.

Monday, May 13, 2013

Sexual Harassment in the field (of anthropology)

Last week there was a lively discussion here on MT of the problem of sexual harassment that occurs in the field, where anthropologists do their work.  The 'field' is sometimes a laboratory, but the particular problem discussed related to the field 'out there' more remote from the university, often far from urban areas and importantly often in other countries.

The discussion concerned many aspects of how we know the extent and diversity of the problem, and that didn't get resolved, but the real problem now at hand is what to do about it....or, more cogently, that something should be done about it.

How to address such a topic in a way that actually gains consensus that is more than pro-forma agreements to bureaucratic documents that can be filed away to guard against lawsuits, and to get real compliance, is not so obvious.  Acceptable sexual behavior is subtle, culturally variable and not all people agree on what the rules should be, though there's no disagreement that assault, including rape, is unacceptable.

So, if people really care about this subject, as it clearly seems they should, then what is needed is to try to find some way to formulate policies and procedures that might actually work.  Discussions about how awful the problem is are fine, but realistically implementable ways to constrain action in unusual, hard-to-monitor settings is what needs attention.  If the ongoing discussion since the Anthropology meetings has brought that attention to this issue, great.  It clearly needs to continue.

Friday, May 10, 2013

Good news for tanning salons - sunlight and health

New research lauding the benefits of sunlight, and not because of vitamin D, brings up a general question about reductive research.  The work was reported in Edinburgh at the International Investigative Dermatology 2013 meeting, and suggests that sunlight helps reduce blood pressure, which leads to lower risk of heart attack and stroke.

Researchers found that UV rays from the sun caused their study subjects to release a compound that has this positive effect on blood pressure.  That's nitric oxide, apparently, which is released into the circulation when sunlight touches the skin.  The researchers note that hypertension and cardiovascular disease rates rise in the winter, and propose it's because of reduced exposure to sunlight.

And, Medical News Today quotes the lead author:
Richard Weller, Senior Lecturer in Dermatology, and colleagues, say the effect is such that overall, sun exposure could improve health and even prolong life, because the benefits of reducing blood pressure, cutting heart attacks and strokes, far outweigh the risk of getting skin cancer.
It's of course notable that these results are being presented at a dermatology meeting, because dermatologists have been telling us for years to reduce our exposure to sunlight, a prime cause of skin cancer.

La promenade (1875) by Claude Monet
Whether these results are valid or not is not our interest here.  Indeed, hypertension is high among African Americans and Afro-Caribbeans, though the data are equivocal for Africa itself. 

But, ok, let's assume there's something to this.  Indeed, let's assume that even a tenth of what people say about vitamin D is true, and that exposure to sunlight is good for our health for multiple reasons.  And that wouldn't be surprising, given that we've lived most of our evolutionary history exposed to sunlight, and if it were as bad for us as dermatology says it is, we'd not have made it this far.

That aside, this brings up the question of competing effects.  Sun exposure is bad because it causes cancer.  No, it's good because we need the vitamin D, and now the nitric oxide.  Red wine in moderation is good for us because it lowers heart disease risk, but it's bad because it causes breast cancer.  Eating fish is good because of antioxidants, but bad because of mercury.  Brown rice is good because it's a source of fiber and vitamins, but bad because it's loaded with arsenic. 

The list could, and does, go on and on.  In large part it's a product of reductive science, looking at single factors and determining single outcomes, ignoring complexity and context.

And life is a balancing of costs and benefits -- exercise is good for us, but running wears out knees, and bicycling brings risk of accidents.  You might decide that the benefits outweigh the costs, but how informed is that decision, really?  How do you decide whether or not to lie in the sun?  You might not in fact be at risk of hypertension, so sun exposure isn't a great benefit to you in terms of lowering your risk of heart disease, and so the potential cost of skin cancer might be greater for you than the benefits.

But, how would you know?  These results are based on population data, measuring only a subset of factors that are actually involved in the complex interactions that result in hypertension or skin cancer or breast cancer or heart disease, and certainly not measuring your personal set of factors, exposures and risk.

In the end, we probably should make these lifestyle and dietary decisions based less on this week's data -- indeed, a lot of the relevant data we don't have and don't even know we should have -- but on how much we enjoy that glass of wine, or lying in the sun.

Thursday, May 9, 2013

Research Ethics in Anthropology: Problems in/with the field


Recently there has been a lot of bioanthropology buzz about sexual harassment and assault in “the field”, the diverse, global settings in which professional anthropologists of various types, and their students, do their research.  This comes at present on the heels of a brief presentation at the recent American Association of Physical Anthropology (AAPA) 2013 meeting, in which a speaker presented results of a rather informal survey poll in which respondents reported various degrees of sexual harassment, ranging from relatively informal intimidating to true assault.  Regardless of the details, the results show that these issues are alive and well in Anthropology.   

Of course lots of people have lots of opinions about such things, and sexual attractions and behaviors have many subtle nuances and misperceptions, as do other aspects of the ethics of working in field sites with a hierarchical authority structure, international collaborators and hosts, and so on.  But what was not part of the presentation, and what seems not really to have resulted so far is some form of realistic call to action that might be implemented.  Perhaps the action will follow the shock and anger, but that isn’t a given.  And while this topic isn’t anything new, (see the post and figure here or the blog post here), it does lead one to think about a few things in new ways. 

For example, my first thought when hearing about the presentation at the AAPA meetings was: Well, this type of thing also happens right here in the ivory tower – so of course it happens out there (in “the field”), too.  Abuse (not just sexual in nature) occurs in households, churches, schools, neighborhoods; it happens anywhere that there are people who are vulnerable to being abused.  Having spent the last several years around Penn State, with our Sandusky child-abuse scandal, this is painfully obvious. 

Do these types of things occur more frequently in the sometimes isolated settings of “the field”?  I don’t really know, and the types of self-reporting web-based surveys being used aren’t going to tell us that with any easily knowable precision, at least not from a statistical standpoint.  However, I do a fair share of fieldwork and I do see how things can get strange rather quickly.  Some people behave more poorly than normal when they are away from authority figures.  Furthermore, different cultures have different norms, ascribe different meanings to different body language, and people who don’t belong to those cultures can quickly get lost in the midst.  Not everyone has, much less understands, the same rules of conduct,

Perhaps, however, the University is a place where these types of things happen more frequently than in other places.  Fiduciary relationships are commonly barred in work settings outside of academia.  I would argue with reason too.  There are lots of reasons why the boss shouldn’t be sleeping with the employees: including favoritism, exploitative relationships, and the ever present potential for things to go too far in an unprofessional work place setting.  But these rules aren’t always present in academia.  Many universities do not have rules barring professors from sleeping with their students.  Even in places where there are rules, those rules are likely to be different with regard to graduate students.  And that is where I think things get really tricky.  Of course a graduate student could develop genuine romantic feelings for a professor and vice versa.  But with the HUGE power differential in place between student and professor, I think the chances for things to quickly go wrong, most likely with untoward consequences for the student, are too great.  

Now, back to “the field”, and I don’t mean the field of Anthropology, I mean...  actually, what is meant by “the field”?  Is it only the place where I collect my data?  Can I do some analysis and even write a paper there too?  Is my field someone else’s home? (For a lot of anthropologists, and their local collaborators, students, or work-hands, yes, it is).  And if my field is someone else’s home, then what does that mean with regard to all of these issues?  I’ve seen enough Americans acting ridiculously while abroad to know that ethical issues go both ways. 

It has to be said as well that in Anthropology, maybe even more than other fields(?), there are those who feel that exerting sexual pressures is part of evolution, and hence is natural, as is sexual inequality and initiative, and that it is unrealistic to expect otherwise.  This explicitly Darwinian evolutionary view may mainly be a convenient rationale for the aggressor, but it is sometimes held unapologetically.  The point is that there is by no means a consensus about what is and isn't acceptable behavior, and that may be why there is a need for some forms of rules.  As with other areas such as respect for property, language, and dress codes, regardless of one's personal ideas about behavioral ethics, one simply would have to agree to the rules and code, and its sanctions, as a condition of access to field settings. 

I’ll close by suggesting that we should be active in fixing this, at least to the extent that we can in the context of a society (our own) in which sexual harassment and abuse are all too common.  I’m not the expert, but perhaps as a group, Anthropologists and other scientists can be.  I see answers to these issues as being broadly split into two main foci: one looking at victims (those who could be or already are) and one looking at the aggressors.  I am sure others can imagine additional ways to address the problem, but here at least is a start.

From a victim standpoint, we could:
1.)    Look at ways of reducing danger when possible.  My introduction to fieldwork was something akin to baptism by fire and I have a feeling it is the same for many others.  Perhaps training prior to actually doing fieldwork should be required (but by who?)  Real or fictional stories could be presented to students about to begin fieldwork.  We can’t delve into blaming victims here, but many times you can be your most reliable part of the equation.  It’s not always possible to control your surroundings and settings, but constantly being aware could at least sometimes help.
2.)   Make it relatively easy and painless for people who are being mistreated to report it.  This can be a real obstacle.  Sometimes groups of people, even prominent institutions, make it hard on victims to speak out.  Sometimes speaking out jeopardizes an entire project or at least the reporter's involvement in it. If we care about fixing this problem, we’ve got to make this step less intimidating, and ensure that victims are a priority.  Furthermore, institutional legal help should be geared toward helping victims seek retribution rather than covering legal butts.  (I know, I know – wishful thinking.  But while I’m making a wish list…)
From the aggressor standpoint:
1.)   Sometimes bad situations can arise simply out of poorly managed operations.  We should start with the assumption that people don’t want to be jerks, and perhaps give faculty who are leading field teams the training they need in order to avoid harmful situations for them and their students.  Perhaps a code of ethics that addresses these sorts of issues should be a mandatory aspect of NSF and NIH grant proposals.  Perhaps our universities and professional affiliations should take an active role too. 
2.)   We need ways to harshly punish perpetrators of abuse, and the teeth must be large, sharp and jagged.  Grants should be at stake, as should professional reputations, academic positions, tenure, and pay.
3.)  Furthermore, not all perpetrators are affiliates of our universities.  Some are residents of our “field” sites, etc.  It is perhaps even more difficult to find ways of policing their action, since they are likely to be under different laws and institutional rules.  However, I would argue that grant money, or more accurately, the threat of losing grant money, can go a long way.  
It’s true that these issues aren’t confined to Anthropology, or to field sites, but that doesn’t mean that the field shouldn’t recognize that there is a problem, and work on ways to address it.  You might think that students, being dependent on mentors and pretty much powerless in the academic hierarchy, are exactly the wrong people to address this.  But we’re also the ones with a lot to lose if we don’t.


----------------------------------
This post was contributed to by Dan, Anne and Ken.  Also, thanks to several others for thoughtful insights and conversations, including faculty in the Department of Anthropology (PSU) and Jessica Westin.