4. Correlation —How does Netflix know what movies I like?

Correlation measures the degree to which two phenomena are related to one another. For example, there is a correlation between summer temperatures and ice cream sales. When one goes up, so does the other. Two variables are positively correlated if a change in one is associated with a change in the other in the same direction, such as the relationship between height and weight. Taller people weigh more (on average); shorter people weigh less. A correlation is negative if a positive change in one variable is associated with a negative change in the other, such as the relationship between exercise and weight.
The tricky thing about these kinds of associations is that not every observation fits the pattern. Sometimes short people weigh more than tall people. Sometimes people who don’t exercise are skinnier than people who exercise all the time. Still, there is a meaningful relationship between height and weight, and between exercise and weight.

library(HistData)

plot(GaltonFamilies$midparentHeight,GaltonFamilies$childHeight)

cor(GaltonFamilies$midparentHeight,GaltonFamilies$childHeight)

## [1] 0.3209499

The correlation coefficient has two fabulously attractive characteristics. First, for math reasons that have been relegated to the appendix, it is a single number ranging from –1 to 1. A correlation of 1, often described as perfect correlation, means that every change in one variable is associated with an equivalent change in the other variable in the same direction.
The closer the correlation is to 1 or –1, the stronger the association. A correlation of 0 (or close to it) means that the variables have no meaningful association with one another, such as the relationship between shoe size and SAT scores.
The second attractive feature of the correlation coefficient is that it has no units attached to it.

Example: The SAT Reasoning Test, formerly known as the Scholastic Aptitude Test, is a standardized exam made up of three sections: math, reading, and writing. Why is a four-hour test so important when college admissions officers have access to four years of high school grades?

The purpose of the test is to measure academic ability and predict college performance.

The correlation between high school grade point average and first-year college grade point average is .56.
The correlation between the SAT composite score (critical reading, math, and writing) and first-year college GPA is also .56.
In fact, the best predictor of all is a combination of SAT scores and high school GPA, which has a correlation of .64 with first-year college grades.

One crucial point in this general discussion is that correlation does not imply causation; a positive or negative association between two variables does not necessarily mean that a change in one of the variables is causing the change in the other.

Example: A likely positive correlation between a student’s SAT scores and the number of televisions that his family owns. This does not mean that overeager parents can boost their children’s test scores by buying an extra five televisions for the house. Nor does it likely mean that watching lots of television is good for academic achievement.

The most logical explanation for such a correlation would be that highly educated parents can afford a lot of televisions and tend to have children who test better than average. Both the televisions and the test scores are likely caused by a third variable, which is parental education.

5. Basic Probability and Expectation

Probability is the study of events and outcomes involving an element of uncertainty.

Investing in the stock market involves uncertainty.
So does flipping a coin, which may come up heads or tails.
Flipping a coin four times in a row involves additional layers of uncertainty, because each of the four flips can result in a head or a tail.
If you flip a coin four times in a row, I cannot know the outcome in advance with certainty (nor can you). Yet I can determine in advance that some outcomes (two heads, two tails) are more likely than others (four heads).
Many events have known probabilities.
Some events have probabilities that can be inferred on the basis of past data.

Example: The Australian Transport Safety Board published a report quantifying the fatality risks for different modes of transport. Despite widespread fear of flying, the risks associated with commercial air travel are tiny. Australia hasn’t had a commercial air fatality since the 1960s, so the fatality rate per 100 million kilometers traveled is essentially zero. The rate for drivers is .5 fatalities per 100 million kilometers traveled. The really impressive number is for motorcycles—if you aspire to be an organ donor. The fatality rate is thirty-five times higher for motorcycles than for cars.

Probability can also sometimes tell us after the fact what likely happened and what likely did not happen.

Example: Humans share similarities in their DNA, just as we share other similarities: shoe size, height, eye color. (More than 99 percent of all DNA is identical among all humans.) If researchers have access to only a small sample of DNA on which only a few loci can be tested, it’s possible that thousands or even millions of individuals may share that genetic fragment. Therefore, the more loci that can be tested, and the more natural genetic variation there is in each of those loci, the more certain the match becomes. Or, to put it a bit differently, the less likely it becomes that the DNA sample will match more than one person.

Often it is extremely valuable to know the likelihood of multiple events’ happening.

The probability of two independent events’ both happening is the product of their respective probabilities. In other words, the probability of Event A happening and Event B happening is the probability of Event A multiplied by the probability of Event B.
intuitive. If the probability of flipping heads with a fair coin is ½ , then the probability of flipping heads twice in a row is ½ × ½ , or ¼ . The probability of flipping three heads in a row is ⅛ , the probability of four heads in a row is 1/16, and so on.
This explains why the system administrator at your school or office is constantly on your case to improve the “quality” of your password. If you have a six-digit password using only numerical digits, we can calculate the number of possible passwords: 10 × 10 × 10 × 10 × 10 × 10, which equals 10 6 , or 1,000,000. That sounds like a lot of possibilities, but a computer could blow through all 1,000,000 possible combinations in a fraction of a second.
There is one crucial distinction here. This formula is applicable only if the events are independent, meaning that the outcome of one has no effect on the outcome of another.
Suppose you are interested in the probability that one event happens or another event happens: outcome A or outcome B (again assuming that they are independent). In this case, the probability of getting A or B consists of the sum of their individual probabilities: the probability of A plus the probability of B.

Expecation

Probability also enables us to calculate what might be the most useful tool in all of managerial decision making, particularly finance: expected value.
clearer. Suppose you are invited to play a game in which you roll a single die. The payoff to this game is $1 if you roll a 1; $2 if you roll a 2; $3 if you roll a 3; and so on. What is the expected value for a single roll of the die? Each possible outcome has a image (Image) probability, so the expected value is:

1/6($1) + 1/6 ($2) + 1/6 ($3) + 1/6 ($4) + 1/6 ($5) + 1/6 ($6) =21/6 , or $3.50.

Suppose you have the chance to play the above game for $3 a throw. Does it make sense to play? Yes, because the expected value of the outcome ($3.50) is higher than the cost of playing ($3.00).
The same basic analysis can illustrate why you should never buy a lottery ticket. In Illinois, the probabilities associated with the various possible payoffs for the game are printed on the back of each ticket. I purchased a $1 instant ticket. (Note to self: Is this tax deductible?) On the back—in tiny, tiny print—are the chances of winning different cash prizes, or a free new ticket: 1 in 10 (free ticket); 1 in 15 ($2); 1 in 42.86 ($4); 1 in 75 ($5); and so on up to the 1 in 40,000 chance of winning $1,000. I calculated the expected payout for my instant ticket by adding up each possible cash prize weighted by its probability. * It turns out that my $1 lottery ticket has an expected payout of roughly $.56, making it an absolutely miserable way to spend $1.
The law of large numbers explains why casinos always make money in the long run. The probabilities associated with all casino games favor the house (assuming that the casino can successfully prevent blackjack players from counting cards). If enough bets are wagered over a long enough time, the casino will be certain to win more than it loses.
Expected value can also help us untangle complex decisions that involve many contingencies at different points in time.

Example: The same basic process can be used to explain a seemingly counterintuitive phenomenon. Sometimes it does not make sense to screen the entire population for a rare but serious disease, such as HIV/AIDS. Suppose we can test for some rare disease with a high degree of accuracy. For the sake of example, let’s assume the disease affects 1 of every 100,000 adults and the test is 99.9999 percent accurate. The test never generates a false negative (meaning that it never misses someone who has the disease); however, roughly 1 in 10,000 tests conducted on a healthy person will generate a false positive, meaning that the person tests positive but does not actually have the disease. The striking outcome here is that despite the impressive accuracy of the test, most of the people who test positive will not have the disease. . This will generate enormous anxiety among those who falsely test positive; it can also waste finite health care resources on follow-up tests and treatment.

Only 1,750 adults have the disease. They all test positive. Over 174 million adults do not have the disease. Of this healthy group who are tested, 99.9999 get the correct result that they do not have the disease. Only .0001 get a false positive.
But .0001 of 174 million is still a big number. In fact, 17,500 people will, on average, get false positives.
Let’s look at what that means. A total of 19,250 people are notified that they have the disease; only 9 percent of them are actually sick!

Sometimes probability helps us by flagging suspicious patterns.

Example: Security. The Securities and Exchange Commission (SEC), the government agency responsible for enforcing federal laws related to securities trading, uses a similar methodology for catching inside traders. (Inside trading involves illegally using private information, such as a law firm’s knowledge of an impending corporate acquisition, to trade stock or other securities in the affected companies.) The SEC uses powerful computers to scrutinize hundreds of millions of stock trades and look for suspicious activity, such as a big purchase of shares in a company just before a takeover is announced, or the dumping of shares just before a company announces disappointing earnings. The SEC will also investigate investment managers with unusually high returns over long periods of time. (Both economic theory and historical data suggest that it is extremely difficult for a single investor to get above-average returns year after year.)

Probability is not deterministic. No, you shouldn’t buy a lottery ticket—but you still might win money if you do. And yes, probability can help us catch cheaters and criminals—but when used inappropriately it can also send innocent people to jail.

6. The Monty Hall Problem

The “Monty Hall problem” is a famous probability-related conundrum faced by participants on the game show Let’s Make a Deal , which premiered in the United States in 1963 and is still running in some markets around the world.
At the end of each day’s show a contestant was invited to stand with host Monty Hall facing three big doors: Door no. 1, Door no. 2, and Door no. 3. Monty explained to the contestant that there was a highly desirable prize behind one of the doors and a goat behind the other two doors. The player chose one of the three doors and would get as a prize whatever was behind it.
After the contestant chose a door, Monty would open one of the two doors that the contestant had not picked , always revealing a goat. At that point, Monty would ask the contestant if he would like to change his pick—to switch from the closed door that he had picked originally to the other remaining closed door.
Should the contestant switch? Yes
This answer seems entirely unintuitive at first. It would appear that the contestant has a one-third chance of winning no matter what he does. There are three closed doors.
The answer lies in the fact that Monty Hall knows what is behind each door. If the contestant picks Door no. 1 and there is a car behind it, then Monty can open either no. 2 or no. 3 to display a goat.
If the contestant picks Door no. 1 and the car is behind no. 2, then Monty opens no. 3.
If the contestant picks Door no. 1 and the car is behind no. 3, then Monty opens no. 2.
By switching after a door is opened, the contestant gets the benefit of choosing two doors rather than one.
Assume that Monty Hall offers you a choice from among 100 doors rather than just three. After you pick your door, say, no. 47, he opens 98 other doors with goats behind them. Now there are only two doors that remain closed, no. 47 (your original choice) and one other, say, no. 61. Should you switch? Of course you should. There is a 99 percent chance that the car was behind one of the doors that you did not originally choose.

7. Problems with Probability—How overconfident math geeks nearly destroyed the global financial system

Statistics cannot be any smarter than the people who use them. And in some cases, they can make smart people do dumb things. One of the most irresponsible uses of statistics in recent memory involved the mechanism for gauging risk on Wall Street prior to the 2008 financial crisis.
At that time, firms throughout the financial industry used a common barometer of risk, the Value at Risk model, or VaR. In theory, VaR combined the elegance of an indicator (collapsing lots of information into a single number) with the power of probability (attaching an expected gain or loss to each of the firm’s assets or trading positions).
Prior to the financial crisis of 2008, firms trusted the VaR model to quantify their overall risk. The formula even took into account the correlations among different positions. For example, if two investments had expected returns that were negatively correlated, a loss in one would likely have been offset by a gain in the other, making the two investments together less risky than either one separately.
Then, even better, the aggregate risk for the firm could be calculated at any point in time by taking the same basic process one step further. The underlying mathematical mechanics are obviously fabulously complicated, as firms had a dizzying array of investments in different currencies, with different amounts of leverage (the amount of money that was borrowed to make the investment), trading in markets with different degrees of liquidity, and so on.
The primary critique of VaR is that the underlying risks associated with financial markets are not as predictable as a coin flip.
The false precision embedded in the models created a false sense of security. The VaR was like a faulty speedometer, which is arguably worse than no speedometer at all. If you place too much faith in the broken speedometer, you will be oblivious to other signs that your speed is unsafe. In contrast, if there is no speedometer at all, you have no choice but to look around for clues as to how fast you are really going.
Unfortunately, there were two huge problems with the risk profiles encapsulated by the VaR models. First, the underlying probabilities on which the models were built were based on past market movements; however, in financial markets, the future does not necessarily look like the past.
Second, even if the underlying data could accurately predict future risk, the 99 percent assurance offered by the VaR model was dangerously useless, because it’s the 1 percent that is going to really mess you up. In fact, the models had nothing to say about how bad that 1 percent scenario might turn out to be. Very little attention was devoted to the “tail risk,” the small risk (named for the tail of the distribution) of some catastrophic outcome.

Probability offers a powerful and useful set of tools—many of which can be employed correctly to understand the world or incorrectly to wreak havoc on it.

Assuming events are independent when they are not.

Now that you are armed with this powerful knowledge, let’s assume that you have been promoted to head of risk management at a major airline. Your assistant informs you that the probability of a jet engine’s failing for any reason during a transatlantic flight is 1 in 100,000. Given the number of transatlantic flights, this is not an acceptable risk. Fortunately each jet making such a trip has at least two engines. Your assistant has calculated that the risk of both engines’ shutting down over the Atlantic must be $(1/100,000)^2$ , or 1 in 10 billion, which is a reasonable safety risk.

The two engine failures are not independent events. If a plane flies through a flock of geese while taking off, both engines are likely to be compromised in a similar way. The same would be true of many other factors that affect the performance of a jet engine, from weather to improper maintenance. If one engine fails, the probability that the second engine fails is going to be significantly higher than 1 in 100,000.

Not understanding when events ARE independent.

If you flip a fair coin 1,000,000 times and get 1,000,000 heads in a row, the probability of getting tails on the next flip is still ½ . The very definition of statistical independence between two events is that the outcome of one has no effect on the outcome of the other. Even if you don’t find the statistics persuasive, you might ask yourself about the physics: How can flipping a series of tails in a row make it more likely that the coin will turn up heads on the next flip ? “the gambler’s fallcy”.

Cluster happen

You’ve probably read the story in the newspaper, or perhaps seen the news exposé: Some statistically unlikely number of people in a particular area have contracted a rare form of cancer. It must be the water, or the local power plant, or the cell phone tower. Of course, any one of those things might really be causing adverse health outcomes. (Later chapters will explore how statistics can identify such causal relationships.) But this cluster of cases may also be the product of pure chance, even when the number of cases appears highly improbable.

Yes, the probability that five people in the same school or church or workplace will contract the same rare form of leukemia may be one in a million, but there are millions of schools and churches and workplaces. It’s not highly improbable that five people might get the same rare form of leukemia in one of those places.

Reversin to the mean(or regression to the mean

The same phenomenon can explain why students who do much better than they normally do on some kind of test will, on average, do slightly worse on a retest, and students who have done worse than usual will tend to do slightly better when retested.

Statistical discriminaton

When is it okay to act on the basis of what probability tells us is likely to happen, and when is it not okay?

In 2003, Anna Diamantopoulou, the European commissioner for employment and social affairs, proposed a directive declaring that insurance companies may not charge different rates to men and women, because it violates the European Union’s principle of equal treatment. 8 To insurers, however, gender-based premiums aren’t discrimination; they’re just statistics.

Men typically pay more for auto insurance because they crash more. Women pay more for annuities (a financial product that pays a fixed monthly or yearly sum until death) because they live longer. Obviously many women crash more than many men, and many men live longer than many women.

8. The impportant data–“Gabage in, garbage out

Data are to statistics what a good offensive line is to a star quarterback. In front of every star quarterback is a good group of blockers. They usually don’t get much credit. But without them, you won’t ever see a star quarterback. Assume that you are using good data, just as a cookbook assumes that you are not buying rancid meat and rotten vegetables. But even the finest recipe isn’t going to salvage a meal that begins with spoiled ingredients. So it is with statistics; no amount of fancy analysis can make up for fundamentally flawed data. Hence the expression “garbage in, garbage out.” Data deserve respect, just like offensive linemen.

We generally ask our data to do one of three things.

First, we may demand a data sample that is representative of some larger group or population. One of the most powerful findings in statistics is that inferences made from reasonably large, properly drawn samples can be every bit as accurate as attempting to elicit the same information from the entire population.
- The easiest way to gather a representative sample of a larger population is to select some subset of that population randomly. (Shockingly, this is known as a simple random sample.) The key to this methodology is that each observation in the relevant population must have an equal chance of being included in the sample.
- A representative sample is a fabulously important thing, for it opens the door to some of the most powerful tools that statistics has to offer.
- Getting a good sample is harder than it looks.
- Many of the most egregious statistical assertions are caused by good statistical methods applied to bad samples, not the opposite.
- Size matters, and bigger is better. The details will be explained in the coming chapters, but it should be intuitive that a larger sample will help to smooth away any freak variation.
The second thing we often ask of data is that they provide some source of comparison. Is a new medicine more effective than the current treatment? Are ex-convicts who receive job training less likely to return to prison than ex-convicts who do not receive such training? Do students who attend charter schools perform better than similar students who attend regular public schools?
- In these cases, the goal is to find two groups of subjects who are broadly similar except for the application of whatever “treatment” we care about.
- In the physical and biological sciences, creating treatment and control groups is relatively straightforward.
- One recurring research challenge with human subjects is creating treatment and control groups that differ only in that one group is getting the treatment and the other is not.
For this reason, the “gold standard” of research is randomization, a process by which human subjects (or schools, or hospitals, or whatever we’re studying) are randomly assigned to either the treatment or the control group. We do not assume that all the experimental subjects are identical. Instead, we assume that randomization will evenly divide all relevant characteristics between the two groups—both the characteristics we can observe, like race or income, but also confounding characteristics that we cannot measure or had not considered, such as perseverance or faith.
We sometimes have no specific idea what we will do with the information—but we suspect it will come in handy at some point. This is similar to a crime scene detective who demands that all possible evidence be captured so that it can be sorted later for clues. Some of this evidence will prove useful, some will not. If we knew exactly what would be useful, we probably would not need to be doing the investigation in the first place.

Behind every important study there are good data that made the analysis possible. And behind every bad study . . . well, read on. People often speak about “lying with statistics.” In fact, some of the most egregious statistical mistakes involve lying with data ; the statistical analysis is fine, but the data on which the calculations are performed are bogus or inappropriate. Here are some common examples of “garbage in, garbage out.”

Slection Bias
- One of the most famous statistical blunders of all time, the notorious Literary Digest poll of 1936, was caused by a biased sample. In that year, Kansas governor Alf Landon, a Republican, was running for president against incumbent Franklin Roosevelt, a Democrat. Literary Digest, an influential weekly news magazine at the time, mailed a poll to its subscribers and to automobile and telephone owners whose addresses could be culled from public records. All told, the Literary Digest poll included 10 million prospective voters, which is an astronomically large sample. As polls with good samples get larger, they get better, since the margin of error shrinks. As polls with bad samples get larger, the pile of garbage just gets bigger and smellier. Literary Digest predicted that Landon would beat Roosevelt with 57 percent of the popular vote. In fact, Roosevelt won in a landslide, with 60 percent of the popular vote and forty-six of forty-eight states in the electoral college. The Literary Digest sample was “garbage in”: the magazine’s subscribers were wealthier than average Americans, and therefore more likely to vote Republican, as were households with telephones and cars in 1936.

Publication Bias
- Here is the first sentence of a New York Times article on the publication bias surrounding drugs for treating depression: “The makers of antidepressants like Prozac and Paxil never published the results of about a third of the drug trials that they conducted to win government approval, misleading doctors and consumers about the drugs’ true effectiveness.” 4 It turns out that 94 percent of studies with positive findings on the effectiveness of these drugs were published, while only 14 percent of the studies with nonpositive results were published. For patients dealing with depression, this is a big deal. When all the studies are included, the antidepressants are better than a placebo by only “a modest margin.”

Recall Bias
- The New York Times Magazine described the insidious nature of this recall bias: The diagnosis of breast cancer had not just changed a woman’s present and the future; it had altered her past. Women with breast cancer had (unconsciously) decided that a higher-fat diet was a likely predisposition for their disease and (unconsciously) recalled a high-fat diet. It was a pattern poignantly familiar to anyone who knows the history of this stigmatized illness: these women, like thousands of women before them, had searched their own memories for a cause and then summoned that cause into memory.

Survivorship bias
- What is a traditional mutual fund company to do? Bogus data to the rescue! Here is how they can “beat the market” without beating the market. A large mutual company will open many new actively managed funds (meaning that experts are picking the stocks, often with a particular focus or strategy). For the sake of example, let’s assume that a mutual fund company opens twenty new funds, each of which has roughly a 50 percent chance of beating the S&P 500 in a given year. (This assumption is consistent with long-term data.) Now, basic probability suggests that only ten of the firm’s new funds will beat the S&P 500 the first year; five funds will beat it two years in a row; and two or three will beat it three years in a row.
- Here comes the clever part. At that point, the new mutual funds with unimpressive returns relative to the S&P 500 are quietly closed. (Their assets are folded into other existing funds.) The company can then heavily advertise the two or three new funds that have “consistently outperformed the S&P 500”—even if that performance is the stock-picking equivalent of flipping three heads in a row. The subsequent performance of these funds is likely to revert to the mean, albeit after investors have piled in. The number of mutual funds or investment gurus who have consistently beaten the S&P 500 over a long period is shockingly small.

9. The Central Limit Theorem

A t times, statistics seems almost like magic. We are able to draw sweeping and powerful conclusions from relatively little data. Somehow we can gain meaningful insight into a presidential election by calling a mere one thousand American voters. We can test a hundred chicken breasts for salmonella at a poultry processing plant and conclude from that sample alone that the entire plant is safe or unsafe. Where does this extraordinary power to generalize come from?

— Much of it comes from the central limit theorem.

The core principle underlying the central limit theorem is that a large, properly drawn sample will resemble the population from which it is drawn. Obviously there will be variation from sample to sample, but the probability that any sample will deviate massively from the underlying population is very low.
That’s the basic intuition behind the central limit theorem. When we add some statistical bells and whistles, we can quantify the likelihood that you will be right or wrong. For example, we might calculate that in a marathon field of 10,000 runners with a mean weight of 155 pounds, there is less than a 1 in 100 chance that a random sample of 60 of those runners (our lost bus) would have a mean weight of 220 pounds or more.
This kind of analysis all stems from the central limit theorem, which, from a statistical standpoint, has Lebron James–like power and elegance. According to the central limit theorem, the sample means for any population will be distributed roughly as a normal distribution around the population mean. Hang on for a moment as we unpack that statement.

Suppose we have a population, like our marathon field, and we are interested in the weights of its members. Any sample of runners, such as each bus of sixty runners, will have a mean.
If we take repeated samples, such as picking random groups of sixty runners from the field over and over, then each of those samples will have its own mean weight. These are the sample means.
Most of the sample means will be very close to the population mean. Some will be a little higher. Some will be a little lower. Just as a matter of chance, a very few will be significantly higher than the population mean, and a very few will be significantly lower.
The central limit theorem tells us that the sample means will be distributed roughly as a normal distribution around the population mean. The normal distribution, as you may remember from Chapter 2, is the bell-shaped distribution (e.g., adult men’s heights) in which 68 percent of the observations lie within one standard deviation of the mean, 95 percent lie within two standard deviations, and so on.
All of this will be true no matter what the distribution of the underlying population looks like. The population from which the samples are being drawn does not have to have a normal distribution in order for the sample means to be distributed normally.

The household income distribution in the United States. Household income is not distributed normally in America; instead, it tends to be skewed to the right. No household can earn less than $0 in a given year, so that must be the lower bound for the distribution. Meanwhile, a small group of households can earn staggeringly large annual incomes—hundreds of millions or even billions of dollars in some cases.
The median household income in the United States is roughly $51,900; the mean household income is $70,900.
Now suppose we take a random sample of 1,000 U.S. households and gather information on annual household income. On the basis of the information above, and the central limit theorem, what can we infer about this sample?
Quite a lot, it turns out. First of all, our best guess for what the mean of any sample will be is the mean of the population from which it’s drawn. The whole point of a representative sample is that it looks like the underlying population. A properly drawn sample will, on average, look like America. There will be hedge fund managers and homeless people and police officers and everyone else—all roughly in proportion to their frequency in the population. Therefore, we would expect the mean household income for a representative sample of 1,000 American households to be about $70,900. Will it be exactly that? No. But it shouldn’t be wildly different either.
If we took multiple samples of 1,000 households, we would expect the different sample means to cluster around the population mean, $70,900. We would expect some means to be higher, and some to be lower. Might we get a sample of 1,000 households with a mean household income of $427,000? Sure, that’s possible—but highly unlikely.

Simulation Examples

# Sampling data from normal population with mean 2 and standard deviation 4

n <- 200

Time <- 1000

sample_mean <- rep(0,Time)

sample_sd <-  rep(0,Time)

for(i in 1:Time){

 samples <- rnorm(n,mean=2, sd=4)
 
 sample_mean[i] <-mean(samples)
 
 sample_sd[i] <- sd(samples)

}

MEAN_mean <- mean(sample_mean)

MEAN_mean

## [1] 2.015055

MEAN_mean_sd <-sd(sample_mean)

MEAN_mean_sd

## [1] 0.2820093

4/sqrt(n)

## [1] 0.2828427

hist(sample_mean, breaks=20, freq=FALSE)

lines(seq(min(sample_mean), max(sample_mean), by=0.01),dnorm(seq(min(sample_mean), max(sample_mean), by=0.01), mean=2, sd=(4/sqrt(n) )), type="l")

MEAN_sd <- mean(sample_sd)

MEAN_sd

## [1] 4.006302

# Sampling data from Multinomial population with 1,2,3,4,5,6 and prob. 1/6,1/6,1/6,1/6,1/6,1/6  

n <- 200

Time <- 1000

sample_mean <- rep(0,Time)

sample_sd <-  rep(0,Time)

for(i in 1:Time){

 samples <-  rmultinom(n, size=1, prob=c(1/6,1/6,1/6,1/6,1/6,1/6))
 
 return_samples <- matrix(c(1,2,3,4,5,6), ncol=6)%*%samples
 
 sample_mean[i] <-mean(return_samples)
 
 sample_sd[i] <- sd(return_samples)

}

MEAN_mean <- mean(sample_mean)

MEAN_mean

## [1] 3.497865

MEAN_mean_sd <-sd(sample_mean)

MEAN_mean_sd

## [1] 0.1182652

sqrt(sum((c(1:6)-3.5)^2)/6/(200))

## [1] 0.1207615

hist(sample_mean, breaks=20, freq=FALSE)

lines(seq(min(sample_mean), max(sample_mean), by=0.01),dnorm(seq(min(sample_mean), max(sample_mean), by=0.01), mean=3.5, sd=(sqrt(sum((c(1:6)-3.5)^2)/6/(200)))), type="l")

sqrt(sum((c(1:6)-3.5)^2)/6)

## [1] 1.707825

MEAN_sd <- mean(sample_sd)

MEAN_sd

## [1] 1.705266

# Sampling data from chi-square population with degree freedom 4 

n <- 200

Time <- 1000

sample_mean <- rep(0,Time)

sample_sd <-  rep(0,Time)

samples <- rchisq(n, 4)

mean(samples)

## [1] 3.964792

sd(samples)

## [1] 2.310256

hist(samples, breaks=20, freq=FALSE)

lines(seq(min(samples), max(samples), by=0.01),dchisq(seq(min(samples), max(samples), by=0.01), 4), type="l")

for(i in 1:Time){

 samples <- rchisq(n, 4)
 
 sample_mean[i] <-mean(samples)
 
 sample_sd[i] <- sd(samples)

}

MEAN_mean <- mean(sample_mean)

MEAN_mean

## [1] 4.00817

MEAN_mean_sd <-sd(sample_mean)

MEAN_mean_sd

## [1] 0.1929887

sqrt(8)/sqrt(n)

## [1] 0.2

hist(sample_mean, breaks=20, freq=FALSE)

lines(seq(min(sample_mean), max(sample_mean), by=0.01),dnorm(seq(min(sample_mean), max(sample_mean), by=0.01), mean=4, sd=(sqrt(8)/sqrt(n) )), type="l")

MEAN_sd <- mean(sample_sd)

sqrt(8)

## [1] 2.828427

MEAN_sd

## [1] 2.82563

Two different measures of dispersion: the standard deviation and the standard error.
- The standard deviation measures dispersion in the underlying population.
- The standard error measures the dispersion of the sample means.
- Here is what ties the two concepts together: The standard error is the standard deviation of the sample means!
The “big picture” here is simple and massively powerful:
- If you draw large, random samples from any population, the means of those samples will be distributed normally around the population mean (regardless of what the distribution of the underlying population looks like).
- Most sample means will lie reasonably close to the population mean; the standard error is what defines “reasonably close.”
- The central limit theorem tells us the probability that a sample mean will lie within a cert ain distance of the population mean. It is relatively unlikely that a sample mean will lie more than two standard errors from the population mean, and extremely unlikely that it will lie three or more standard errors from the population mean.
- The less likely it is that an outcome has been observed by chance, the more confident we can be in surmising that some other factor is in play.
That’s pretty much what statistical inference is about. The central limit theorem is what makes most of it possible.

Naked Statisics II – Basic Ideas Behind Statistics

PENG Heng

2022-10-24