There is a striking degree of correlation between a student’s performance on the midterm and on the final. It is highly unusual for a student to score below average on the midterm and then near the top of the class on the final.
It is highly unusual for a student to score below average on the midterm and then near the top of the class on the final.
Statistics cannot prove anything with certainty. Instead, the power of statistical inference derives from observing some pattern or outcome and then using probability to determine the most likely explantation for that outcome.
Of course, the most likely explanation is not always the right explanation. Extremely rare things happen.
Suppose your classmate offfers you a wager: He wins $1,000 if he rolls a six with a single die; you win $500 if he rolls anything else—a pretty good bet from your standpoint. He then proceeds to roll ten sixes in a row, taking $10,000 from you.
Anyway, we should appreciate that statistical inference uses data to address important questions.
Is a new drug effective in treating heart disease? Do cell phones cause brain cancer?
Please note that statistics could not answer these kinds of questions unequivocally; instead, inference tells us what is likely, and what is unlikely.
Example:
Researchers cannot prove that a new drug is effective in treating heart disease, even when they have data from a carefully controlled clinical trial. After all, it is entirely possible that there will be random variation in the outcomes of patients in the treatment and control groups that are unrelated to the new drug. If 53 out of 100 patients taking the new heart disease medication showed marked improvement compared with 49 patients out of 100 receiving a placebo, we would not immediately conclude that the new medication is effective. This is an outcome that can easily be explained by chance variation between the two groups rather than by the new drug.
But suppose instead that 91 out of 100 patients receiving the new drug show marked improvement, compared with 49 out of 100 patients in the control group. It is still possible that this impressive result is unrelated to the new drug; the patients in the treatment group may be particularly lucky or resilient. But that is now a much less likely explanation.
In the formal language of statistical inference, researchers would likely conclude the following: (1) If the experimental drug has no effect, we would rarely see this amount of variation in outcomes between those who are receiving the drug and those who are taking the placebo. (2) It is therefore highly improbable that the drug has no positive effect. (3) The alternative—and more likely—explanation for the pattern of data observed is that the experimental drug has a positive effect.
Hypothesis Testing
As noted above, statistics alone cannot prove anything; instead, we use statistical inference to accept or reject explanations on the basis of their relative likelihood. To be more precise, any statistical inference begins with an implicit or explicit null hypothesis.
If we reject the null hypothesis, then we typically accept some alternative hypothesis that is more consistent with the data observed.
For example, in a court of law the starting assumption, or null hypothesis, is that the defendant is innocent. The job of the prosecution is to persuade the judge or jury to reject that assumption and accept the alternative hypothesis, which is that the defendant is guilty.
As a matter of logic, the alternative hypothesis is a conclusion that must be true if we can reject the null hypothesis. Consider some examples.
Example:
Null hypothesis: This new experimental drug is no more effective at preventing malaria than a placebo.
Alternative hypothesis: This new experimental drug can help to prevent malaria.
The data: One group is randomly chosen to receive the new experimental drug, and a control group receives a placebo. At the end of some period of time, the group receiving the experimental drug has far fewer cases of malaria than the control group.
At the end of some period of time, the group receiving the experimental drug has far fewer cases of malaria than the control group. This would be an extremely unlikely outcome if the experimental drug had no medical impact. As a result, we reject the null hypothesis that the new drug has no impact (beyond that of a placebo), and we accept the logical alternative, which is our alternative hypothesis: This new experimental drug can help to prevent malaria.
Example:
Null hypothesis: Substance abuse treatment for prisoners does not reduce their rearrest rate after leaving prison.
Alternative hypothesis: Substance abuse treatment for prisoners will make them less likely to be rearrested after they are released.
The (hypothetical) data: Prisoners were randomly assigned into two groups; the “treatment” group received substance abuse treatment and the control group did not.
At the end of five years, both groups have similar rearrest rates. In this case, we cannot reject the null hypothesis.
Researchers typically ask, If the null hypothesis is true, how likely is it that we would observe this pattern of data by chance?
One of the most common thresholds that researchers use for rejecting a null hypothesis is 5 percent, which is often written in decimal form: .05. This probability is known as a significance level, and it represents the upper bound for the likelihood of observing some pattern of data if the null hypothesis were true.
Two type errors. If the .05 significance level seems somewhat arbitrary, that’s because it is. There is no single standardized statistical threshold for rejecting a null hypothesis. Both .01 and .1 are also reasonably common thresholds for doing the kind of analysis described above.
Example: When you read in the newspaper that people who eat twenty bran muffins a day have lower rates of colon cancer than people who don’t eat prodigious amounts of bran, the underlying academic research probably looked something like this:
In some large data set, researchers determined that individuals who ate at least twenty bran muffins a day had a lower incidence of colon cancer than individuals who did not report eating much bran.
The researchers’ null hypothesis was that eating bran muffins has no impact on colon cancer.
The disparity in colon cancer outcomes between those who ate lots of bran muffins and those who didn’t could not easily be explained by chance alone.
The academic paper probably contains a conclusion that says something along these lines: “We find a statistically significant association between daily consumption of twenty or more bran muffins and a reduced incidence of colon cancer. These results are significant at the .05 level.”
Example: An article in the Wall Street Journal in May of 2011 carried the headline “Link in Autism, Brain Size.” This is an important breakthrough, as the causes of autism spectrum disorder remain elusive. The first sentence of the Wall Street Journal story, which summarized a paper published in the Archives of General Psychiatry , reports, “Children with autism have larger brains than children without the disorder, and the growth appears to occur before age 2, according to a new study released on Monday.”
On the basis of brain imaging conducted on 59 children with autism and 38 children without autism, researchers at the University of North Carolina reported that children with autism have brains that are up to 10 percent larger than those of children of the same age without autism.
Here is the relevant medical question: Is there a physiological difference in the brains of young children who have autism spectrum disorder?
If so, this insight might lead to a better understanding of what causes the disorder and how it can be treated or prevented.
Here is the relevant statistical question: Can researchers make sweeping inferences about autism spectrum disorder in general that are based on a study of a seemingly small group of children with autism (59) and an even smaller control group (38)—a mere 97 subjects in all?
The answer is yes. The researchers concluded that the probability of observing the differences in total brain size that they found in their two samples would be a mere 2 in 1,000 (p = .002) if there is in fact no real difference in brain size between children with and without autism spectrum disorder in the overall population.
As a result, we can infer from our sample that 95 times out of 100 the interval of 1310.4 cubic centimeters ± 26 (which is two standard errors) will contain the average brain volume for all children with autism spectrum disorder. This expression is called a confidence interval. We can say with 95 percent confidence that the range 1284.4 to 1336.4 cubic centimeters contains the average total brain volume for children in the general population with autism spectrum disorder.
Using the same methodology, we can say with 95 percent confidence that the interval of 1238.8 ± 36, or between 1202.8 and 1274.8 cubic centimeters, will include the average brain volume for children in the general population who do not have autism spectrum disorder.
For all the wonders of statistical inference, there are some significant pitfalls. They derive from the example: student’s suspicious professor.
The powerful process of statistical inference is based on probability, not on some kind of cosmic certainty.As a result, we have a fundamental dilemma when it comes to any kind of hypothesis testing.
Claims that defy almost every law of science are by definition extraordinary and thus require extraordinary evidence. Neglecting to take this into account—as conventional social science analyses do—makes many findings look far more significant than they really are.
One answer to this kind of nonsense would appear to be a more rigorous threshold for defining statistical significance, such as .001.
The higher the threshold for rejecting the null hypothesis, the more likely it is that we will fail to reject a null hypothesis that ought to be rejected.
Choosing an appropriate significance level involves an inherent trade-off.
If we adopt a .001 significance level in the clinical trials for all new cancer drugs, then we will indeed minimize the approval of ineffective drugs. (There is only a 1 in 1,000 chance of wrongly rejecting the null hypothesis that the drug is no more effective than a placebo.) Yet now we introduce the risk of not approving many effective drugs because we have set the bar for approval so high. This is known as a Type II error, or false negative.
Which kind of error is worse? That depends on the circumstances. The most important point is that you recognize the trade-off. There is no statistical free lunch.
Spam filters The null hypothesis is that any particular e-mail message is not spam. Your spam filter looks for clues that can be used to reject that null hypothesis for any particular e-mail, such as huge distribution lists or phrases like “penis enlargement.”
Screening for cancer The null hypothesis for anyone undergoing this kind of screening is that no cancer is present. The screening is used to reject this null hypothesis if the results are suspicious.
Capturing terrorists Neither a Type I nor a Type II error is acceptable in this situation,
Statistical inference is not magic, nor is it infallible, but it is an extraordinary tool for making sense of the world. We can gain great insight into many life phenomena just by determining the most likely explanation. Most of us do this all the time. Statistical inference merely formalizes the process.