Data Learning II – Regression

Can stress on the job kill you? Yes

There is compelling evidence that rigors on the job can lead to premature death, particularly of heart disease. But it’s not the kind of stress you are probably imagining.
CEO, routinely make massively important decisions that determine the fate of their company.
Secretaries, dutifully answer the phone and perform other tasks as instructed.
It turns out that athe most dangerous kind of job stress stems from having low control over one’s responsibilites.
Several studies of thousands of British civil servants (the Whitehall studies) have found that workers who have little control over their jobs—meaning they have minimal say over what tasks are performed or how those tasks are carried out—have a significantly higher mortality rate than other workers in the civil service with more decision-making authority.
According to the above research, it is not the stress associated with major responsibilities that will kill you; it is the stress associated with being told what to do while having little say in how or when it gets done.
How researchers can possibly come to such a conclusion ?
- Clearly this cannot be a randomized experiment.
- Instead, Instead, researchers have collected detailed longitudinal data on thousands of individuals in the British civil service; these data can be analyzed to identify meaningful associations, such as the connection between “low control” jobs and coronary heart disease.
- A simple association is not enough to conclude that certain kinds of jobs are bad for your health.
- If we merely observe that low-ranking workers in the British civil service hierarchy have higher rates of heart disease, our results would be confounded by other factors, i.e. Low level workers vs less education, smoke, low salary vs limite their access to health care, and so on.
- The point is that any study simply comparing health outcomes across a large group of British workers—or across any other large group—will not really tell us much. Other sources of variation in the data are likely to obscure the relationship that we care about. Is “low job control” really causing heart disease? Or is it some combination of other factors that happen to be shared by people with low job control, in which case we may be completely missing the real public health threat.

Regression analysis is the statistical tool that helps us deal with this challenge.
- It allows us to quantify the relationship between a particular variable and an outcome that we care about while controlling for other factors.
- Given adequate data and access to a personal computer, the problem is that the mechanics of regression analysis are not the hard part; the hard part is determining which variables ought to be considered in the analysis and how that can best be done.
- The second immportant phrase is “help us estimate” or how to interpretation of the results of regression.
- Regression analysis is similar to polling. The good news is that if we have a large representative sample and solid methodology, the relationship we observe for our sample data is not likely to deviate wildly from the true relationship for the whole population.
- The bad news is that we are not proving definitively that exercise prevents heart disease. We are instead rejecting the null hypothesis that exercise has no association with heart disease, on the basis of some statistical threshold that was chosen before the study was conducted.
- Causality could go the other direction. Could having a healthy heart “cause” exercise? Yes. Individuals who are infirm, particularly those who have some incipient form of heart disease, will find it much harder to exercise. They will certainly be less likely to play squash regularly.
- Regression analysis has the amazing capacity to isolate a statistical relationship that we care about, such as that between job control and heart disease, while taking into account other factors that might confuse the relationship.
- How can we discern which part of their poor cardiovascular health is due to their low-level jobs, and which part is due to the smoking? These two factors seem inextricably intertwined. Regression analysis (done properly!) can untangle them.

Example: Simple Regression

The Boston data set records medv (median house value) for 506 census tracts in Boston. We will seek to predict medv using 12 predictors such as rmvar (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status).

library(MASS)
library(ISLR2)

## 
## Attaching package: 'ISLR2'

## The following object is masked from 'package:MASS':
## 
##     Boston

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

attach(Boston)

lm.fit <- lm(medv ~ lstat)

lm.fit

## 
## Call:
## lm(formula = medv ~ lstat)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95

summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ lstat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

names(lm.fit)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

coef(lm.fit)

## (Intercept)       lstat 
##  34.5538409  -0.9500494

confint(lm.fit)

##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

predict(lm.fit, data.frame(lstat = (c(5, 10, 15))),
    interval = "confidence")

##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461

plot(lstat, medv)

abline(lm.fit, lwd=3, col="red")

Hence \[ \mbox{medv} \approx 34.5538 -0.9500\times \mbox{lstat}\]

par(mfrow = c(2, 2))
plot(lm.fit)

par(mfrow = c(1, 1))
plot(predict(lm.fit), residuals(lm.fit))

# On the basis of the residual plots, there is some evidence of non-linearity. Leverage statistics can be computed for any number of predictors using the hatvalues() function.

plot(hatvalues(lm.fit))

which.max(hatvalues(lm.fit))

## 375 
## 375

Example: Multiple Linear Regression

lm.fit <- lm(medv ~ lstat + age, data = Boston)
summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

# The Boston data set contains 12 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:

lm.fit <- lm(medv ~ ., data = Boston)

summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1304  -2.7673  -0.5814   1.9414  26.2526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
## crim         -0.121389   0.033000  -3.678 0.000261 ***
## zn            0.046963   0.013879   3.384 0.000772 ***
## indus         0.013468   0.062145   0.217 0.828520    
## chas          2.839993   0.870007   3.264 0.001173 ** 
## nox         -18.758022   3.851355  -4.870 1.50e-06 ***
## rm            3.658119   0.420246   8.705  < 2e-16 ***
## age           0.003611   0.013329   0.271 0.786595    
## dis          -1.490754   0.201623  -7.394 6.17e-13 ***
## rad           0.289405   0.066908   4.325 1.84e-05 ***
## tax          -0.012682   0.003801  -3.337 0.000912 ***
## ptratio      -0.937533   0.132206  -7.091 4.63e-12 ***
## lstat        -0.552019   0.050659 -10.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared:  0.7343, Adjusted R-squared:  0.7278 
## F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16

names(summary(lm.fit))

##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

summary(lm.fit)$r.sq

## [1] 0.734307

summary(lm.fit)$sigma

## [1] 4.798034

# What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age has a high p-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age.

lm.fit1 <- lm(medv ~ . - age, data = Boston)
summary(lm.fit1)

## 
## Call:
## lm(formula = medv ~ . - age, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1851  -2.7330  -0.6116   1.8555  26.3838 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.525128   4.919684   8.441 3.52e-16 ***
## crim         -0.121426   0.032969  -3.683 0.000256 ***
## zn            0.046512   0.013766   3.379 0.000785 ***
## indus         0.013451   0.062086   0.217 0.828577    
## chas          2.852773   0.867912   3.287 0.001085 ** 
## nox         -18.485070   3.713714  -4.978 8.91e-07 ***
## rm            3.681070   0.411230   8.951  < 2e-16 ***
## dis          -1.506777   0.192570  -7.825 3.12e-14 ***
## rad           0.287940   0.066627   4.322 1.87e-05 ***
## tax          -0.012653   0.003796  -3.333 0.000923 ***
## ptratio      -0.934649   0.131653  -7.099 4.39e-12 ***
## lstat        -0.547409   0.047669 -11.483  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.794 on 494 degrees of freedom
## Multiple R-squared:  0.7343, Adjusted R-squared:  0.7284 
## F-statistic: 124.1 on 11 and 494 DF,  p-value: < 2.2e-16

Example: Interaction Terms

It is easy to include interaction terms in a linear model using the lm() function. The syntax lstat:black tells R to include an interaction term between lstat and black. The syntax lstat * age simultaneously includes lstat, age, and the interaction term lstat×age as predictors; it is a shorthand for lstat + age + lstat:age. %We can also pass in transformed versions of the predictors.

summary(lm(medv ~ lstat * age, data = Boston))

## 
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

Hence we could have

\[ \mbox{medv} \approx 36.0885 -1.3921 \mbox{lstat}-0.0007 \mbox{age}+0.0042 \mbox{lstat}\times \mbox{age}\]

Example: Non-linear Transformation of the Preditors

The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor $X$, we can create a predictor $X^2$ using $I(X^2)$. The function I() is needed since the ^ has a special meaning in a formula object; wrapping as we do allows the standard usage in R, which is to raise X to the power 2. We now perform a regression of medv onto lstat and lstat^2.

lm.fit2 <- lm(medv ~ lstat + I(lstat^2))

summary(lm.fit2)

## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

So \[ \mbox{medv} \approx 42.8620 -2.3328 \mbox{lstat}+0.0435\mbox{lstat}^2 \]

par(mfrow = c(2, 2))
plot(lm.fit2)

In order to create a cubic fit, we can include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher-order polynomials. A better approach involves using the poly() function to create the polynomial within lm(). For example, the following command produces a fifth-order polynomial fit:

lm.fit5 <- lm(medv ~ poly(lstat, 5))
summary(lm.fit5)

## 
## Call:
## lm(formula = medv ~ poly(lstat, 5))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5433  -3.1039  -0.7052   2.0844  27.1153 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
## poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
## poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
## poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
## poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
## poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared:  0.6817, Adjusted R-squared:  0.6785 
## F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

# of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation.

summary(lm(medv ~ log(rm), data = Boston))

## 
## Call:
## lm(formula = medv ~ log(rm), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.487  -2.875  -0.104   2.837  39.816 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -76.488      5.028  -15.21   <2e-16 ***
## log(rm)       54.055      2.739   19.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.915 on 504 degrees of freedom
## Multiple R-squared:  0.4358, Adjusted R-squared:  0.4347 
## F-statistic: 389.3 on 1 and 504 DF,  p-value: < 2.2e-16

Example: Qualitative Predictors

We will now examine the Carseats data, which is part of the ISLR2 library. We will attempt to predict Sales (child car seat sales) in 400 locations based on a number of predictors.

head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

The Carseats data includes qualitative predictors such as shelveloc, an indicator of the quality of the shelving location—that is, the space within a store in which the car seat is displayed—at each location. The predictor shelveloc takes on three possible values: Bad, Medium, and Good. Given a qualitative variable such as shelveloc, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.

lm.fit <- lm(Sales ~ . + Income:Advertising + Price:Age, 
    data = Carseats)
summary(lm.fit)

## 
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9208 -0.7503  0.0177  0.6754  3.3413 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
## CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
## Income              0.0108940  0.0026044   4.183 3.57e-05 ***
## Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
## Population          0.0001592  0.0003679   0.433 0.665330    
## Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
## ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
## ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
## Age                -0.0579466  0.0159506  -3.633 0.000318 ***
## Education          -0.0208525  0.0196131  -1.063 0.288361    
## UrbanYes            0.1401597  0.1124019   1.247 0.213171    
## USYes              -0.1575571  0.1489234  -1.058 0.290729    
## Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
## Price:Age           0.0001068  0.0001333   0.801 0.423812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared:  0.8761, Adjusted R-squared:  0.8719 
## F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16

# The contrasts() function returns the coding that R uses for the dummy variables.

attach(Carseats)
contrasts(ShelveLoc)

##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1

R has created a ShelveLocGood dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables. The fact that the coefficient for ShelveLocGood in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium has a smaller positive coefficient, indicating that a medium shelving location is associated with higher sales than a bad shelving location but lower sales than a good shelving location.

**For any regression coefficient, you will generally be interested in three things: sign, size, and significance.**

Sign . The sign (positive or negative) on the coefficient for an independent variable tells us the direction of its association with the dependent variable (the outcome we are trying to explain).
Size. How big is the observed effect between the independent variable and the dependent variable? Is it of a magnitude that matter?
- Statistical Significant and socially insignificants
For example, suppose that we are examining determinants of income. Why do some people make more money than others? The explanatory variables are likely to be things like education, years of work experience, and so on. In a large data set, researchers might also find that people with whiter teeth earn $86 more per year than other workers, ceteris paribus.
Significance Is the observed result an aberration based on a quirky sample of data, or does it reflect a meaningful association that is likely to be observed for the population as a whole?
- Our basic regression analysis produces one other statistic of note: the $R^2$ which is a measure of the total amount of variation explained by the regression equation.

When we include multiple variables in the regression equation, the analysis gives us an estimate of the linear association between each explanatory variable and the dependent variable while holding other dependent variables constant, or “controlling for” these other factors.

Example:

Weight ~ Height + Age + sex +Education + poverty + race + exercise+etc

We can use regression analysis to separate out the independent effect of each of the potential explanatory factors described above. For example, we can isolate the association between race and weight, holding constant other socioeconomic factors like educational background and poverty. Among people who are high school graduates and eligible for food stamps, what is the statistical association between weight and being black?

Example:

Gender discrimination in the workplace: The curious thing about discrimination is that it’s hard to observe directly. No employer ever states explicitly that someone is being paid less because of his or her race or gender or that someone has not been hired for discriminatory reasons (which would presumably leave the person in a different job with a lower salary).
Instead, what we observe are gaps in pay by race and gender that may be the result of discrimination: whites earn more than blacks; men earn more than women; and so on.
The methodological challenge is that these observed gaps may also be the result of underlying differences in workers that have nothing to do with workplace discrimination, such as the fact that women tend to choose more part-time work. How much of the wage gap is due to factors associated with productivity on the job, and how much of the gap, if any, is due to labor force discrimination? No one can claim that this is a trivial question.
Regression analysis can help us answer it. However, our methodology will be slightly more roundabout than it was with our analysis explaining weight. Since we cannot measure discrimination directly, we will examine other factors that traditionally explain wages, such as education, experience, occupational field, and so on.
The case for discrimination is circumstantial: If a significant wage gap remains after controlling for other factors that typically explain wages, then discrimination is a likely culprit. The larger the unexplained portion of any wage gap, the more suspicious we should be.
According to the study, discrimination is not a likely explanation for most of the gap. The gender wage gap fades away as the authors add more explanatory variables to the analysis. For example, men take more finance classes in the MBA program and graduate with higher grade point averages. When these data are included as control variables in the regression equation, the unexplained portion of the gap in male-female earnings drops to 19 percent.
When variables are added to the equation to account for post-MBA work experience, particularly spells out of the labor force, the unexplained portion of the male-female wage gap drops to 9 percent. And when explanatory variables are added for other work characteristics, such as employer type and hours worked, the unexplained portion of the gender wage gap falls to under 4 percent.
For workers who have been in the labor force more than ten years, the authors can ultimately explain all but 1 percent of the gender wage gap with factors unrelated to discrimination on the job.
Conclusion: “We identify three proximate reasons for the large and rising gender gap in earnings: differences in training prior to MBA graduation; differences in career interruptions; and differences in weekly hours. These three determinants can explain the bulk of gender differences across the years following MBA completion.”

Example:

The connection between stress on the job and coronary heart disease.

The Whitehall studies of British civil servants sought to measure the association between grade of employment and death from coronary heart disease over the ensuing years.
One of the early studies followed 17,530 civil servants for seven and a half years.
The authors concluded, “Men in the lower employment grades were shorter, heavier for their height, had higher blood pressure, higher plasma glucose, smoked more, and reported less leisure-time physical activity than men in the higher grades. Yet when allowance was made for the influence on mortality of all of these factors plus plasma cholesterol, the inverse association between grade of employment and [coronary heart disease] mortality was still strong.”
The study demonstrates that holding other health factors constant (including height, which is a decent proxy for early childhood health and nutrition), working in a “low grade” job can literally kill you.
“low-control” jobs are bad for your health. That may or may not be synonymous with being low on the administrative totem pole.
A follow-up study using a second sample of 10,308 British civil servants sought to drill down on this distinction. The workers were once again divided into administrative grades—high, intermediate, and low—only this time the participants were also given a fifteen-item questionnaire that evaluated their level of “decision latitude or control.” These included questions such as “Do you have a choice in deciding how you do your job?” and categorical responses (ranging from “never” to “often”) to statements such as “I can decide when to take a break.”
The researchers found that the “low-control” workers were at significantly higher risk of developing coronary heart disease over the course of the study than “high-control” workers. Yet researchers also found that workers with rigorous job demands were at no greater risk of developing heart disease, nor were workers who reported low levels of social support on the job. Lack of control seems to be the killer, literally.
The Whitehall studies have two characteristics typically associated with strong research.
- First, the results have been replicated elsewhere. In the public health literature, the “low-control” idea evolved into a term known as “job strain,” which characterizes jobs with “high psychological workload demands” and “low decision latitude.” Between 1981 and 1993, thirty-six studies were published on the subject; most found a significant positive association between job strain and heart disease.
- Second, researchers sought and found corroborating biological evidence to explain the mechanism by which this particular kind of stress on the job causes poor health. Work conditions that involve rigorous demands but low control can cause physiological responses (such as the release of stress-related hormones) that increase the risk of heart disease over the long run. Even animal research plays a role; low-status monkeys and baboons (who bear some resemblance to civil servants at the bottom of the authority chain) have physiological differences from their high-status peers that put them at greater cardiovascular risk.

Common Regression Mistakes

“Do not kill people with your research.” Because some very smart people have inadvertently violated that rule.

Beginning in the 1990s, the medical establishment coalesced around the idea that older women should take estrogen supplements to protect against heart disease, osteoporosis, and other conditions associated with menopause.
By 2001, some 15 million women were being prescribed estrogen in the belief that it would make them healthier.
In particular, a longitudinal study of 122,000 women (the Nurses’ Health Study) found a negative association between estrogen supplements and heart attacks. Women taking estrogen had one-third as many heart attacks as women who were not taking estrogen.
This was not a couple of teenagers using dad’s computer to check out pornography and run regression equations. The Nurses’ Health Study is run by the Harvard Medical School and the Harvard School of Public Health.
Meanwhile, scientists and physicians offered a medical theory for why hormone supplements might be beneficial for female health. A woman’s ovaries produce less estrogen as she ages; if estrogen is important to the body, then making up for this deficit in old age could be protective of a woman’s long-term health. Hence the name of the treatment: hormone replacement therapy. Some researchers even began to suggest that older men should receive an estrogen boost.
While millions of women were being prescribed hormone replacement therapy, estrogen was subjected to the most rigorous form of scientific scrutiny: clinical trials.
Health Study for statistical associations that may or may not be causal, a clinical trial consists of a controlled experiment. One sample is given a treatment, such as hormone replacement; another sample is given a placebo. Clinical trials showed that women taking estrogen had a higher incidence of heart disease, stroke, blood clots, breast cancer, and other adverse health outcomes. Estrogen supplements did have some benefits, but those benefits were far outweighed by other risks.
Beginning in 2002, doctors were advised not to prescribe estrogen for their aging female patients.
The New York Times Magazine asked a delicate but socially significant question: How many women died prematurely or suffered strokes or breast cancer because they were taking a pill that their doctors had prescribed to keep them healthy?
- The answer: “A reasonable estimate would be tens of thousands.”

Regression analysis is the hydrogen bomb of the statistics arsenal. What could possibly go wrong?

Regression analysis provides precise answers to complicated questions. These answers may or may not be accurate.
In the wrong hands, regression analysis will yield results that are misleading or just plain wrong. And, as the estrogen example illustrates, even in the right hands this powerful statistical tool can send us speeding dangerously in the wrong direction.
As with all other kinds of statistical analysis, clever people can knowingly exploit these methodological points to nefarious ends.
We should not use explanatory variables that might be affected by the outcome that we are trying to explain, or else the results will become hopelessly tangled.

Here is a “Top Seven” list of the most common abuses of an otherwise extraordinary tool.

1. Using regression to analyze a nonlinear relationship

2. Correlation does not equal causation

If we were to include annual per capita income in China as an explanatory variable, we would almost certainly find a positive and statistically significant association between rising incomes in China and rising autism rates in the U.S. over the past twenty years.
Why? Because they both have been rising sharply over the same period. Yet I highly doubt that a sharp recession in China would reduce the autism rate in the United States.
To be fair, if we observed a strong relationship between rapid economic growth in China and autism rates in China alone, we might begin to search for some environmental factor related to economic growth, such as industrial pollution, that might explain the association.

3. Reverse causality

A statistical association between A and B does not prove that A causes B. In fact, it’s entirely plausible that B is causing A.
Causality may go in both directions. Education spending could boost economic growth, which makes possible additional education spending—the causality could be going in both ways.

4. Omitted variable bias

Regression results will be misleading and inaccurate if the regression equation leaves out an important explanatory variable, particularly if other variables in the equation “pick up” that effect.
“Golfers More Prone to Heart Disease, Cancer, and Arthritis!”, it is not suprprised.
Golf is probably good for your health because it provides socialization and modest exercise.
How can those two statements be reconciled?
Very easily. Any study that attempts to measure the effects of playing golf on health must control properly for age.
In general, people play more golf when they get older, particularly in retirement. Any analysis that leaves out age as an explanatory variable is going to miss the fact that golfers, on average, will be older than nongolfers. Golf isn’t killing people; old age is killing people, and they happen to enjoy playing golf while it does.

5. Highly correlated explanatory variables (multicollinearity)

If a regression equation includes two or more explanatory variables that are highly correlated with each other, the analysis will not necessarily be able to discern the true relationship between each of those variables and the outcome that we are trying to explain.

6. Extrapolating beyond the data

Regression analysis, like all forms of statistical inference, is designed to offer us insights into the world around us. We seek patterns that will hold true for the larger population. However, our results are valid only for a population that is similar to the sample on which the analysis has been done

7. Data mining too many variables

Your results can be compromised if you include too many variables, particularly extraneous explanatory variables with no theoretical justification.
Since we don’t know what causes autism, we should put as many potential explanatory variables as possible in the regression equation just to see what might turn up as statistically significant; then maybe we’ll get some answers.
If you put enough junk variables in a regression equation, one of them is bound to meet the threshold for statistical significance just by chance. The further danger is that junk variables are not always easily recognized as such.
In a class of forty students or so, let each student flip a coin. Any student who flips tails is eliminated; the rest flip again. In the second round, those who flip tails are once again eliminated. continuing the rounds of flipping until one student has flipped five or six heads in a row. You may recall some of the silly follow-up questions: “What’s your secret? Is it in the wrist? Can you teach us to flip heads all the time? Maybe it’s that Harvard sweatshirt you’re wearing.”
Obviously the string of heads is just luck; the students have all watched it happen. However, that is not necessarily how the result could or would be interpreted in a scientific context.

Example: Data from The Fifth National Bank of Springfield for gender discrimination suit

BANK1 <-read.csv("BANK1.csv") # laoding the data set  into R

names(BANK1) # Check the variables collected by the data

##  [1] "X"        "Employee" "EducLev"  "JobGrade" "YrHired"  "YrBorn"  
##  [7] "Gender"   "YrsPrior" "PCJob"    "Salary"

# Cleaning data

BANK1 <- BANK1[,-1] # removing unnecesary variables.

BANK1$YrsExp <- 95 -BANK1$YrHired # transforming data to be more significant 

BANK1$Age <- 95 - BANK1$YrBorn # Calculate the age of employ ee

BANK1_new <-BANK1[(BANK1$Age<=60)&(BANK1$YrsExp<=30), ] # clearning the observations which do not appear typical. 

dim(BANK1_new)

## [1] 199  11

BANK1_new$JobGrade <-as.factor(BANK1_new$JobGrade) # Define the categorical data as data type factor 

BANK1_new$EducLev <-as.factor(BANK1_new$EducLev)

Results <-lm(Salary~.-Employee-YrBorn-YrHired, data=BANK1_new) # Regression to check the dependence of the salary with other variables 

summary(Results)

## 
## Call:
## lm(formula = Salary ~ . - Employee - YrBorn - YrHired, data = BANK1_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9977 -2.1509 -0.2543  1.7764 11.1935 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.87729    1.57539  18.330  < 2e-16 ***
## EducLev2    -0.89115    0.89698  -0.994  0.32177    
## EducLev3    -0.29770    0.86872  -0.343  0.73222    
## EducLev4     0.33981    1.51557   0.224  0.82284    
## EducLev5     1.92200    1.06187   1.810  0.07193 .  
## JobGrade2    1.60927    0.74640   2.156  0.03238 *  
## JobGrade3    5.34225    0.79817   6.693 2.56e-10 ***
## JobGrade4    9.77401    0.95723  10.211  < 2e-16 ***
## JobGrade5   15.33462    1.20896  12.684  < 2e-16 ***
## JobGrade6   22.40778    1.80388  12.422  < 2e-16 ***
## GenderMale   0.67645    0.64632   1.047  0.29665    
## YrsPrior     0.14592    0.09045   1.613  0.10840    
## PCJobYes     4.18315    0.92580   4.518 1.11e-05 ***
## YrsExp       0.25604    0.07478   3.424  0.00076 ***
## Age          0.02409    0.03905   0.617  0.53806    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.538 on 184 degrees of freedom
## Multiple R-squared:  0.8232, Adjusted R-squared:  0.8098 
## F-statistic:  61.2 on 14 and 184 DF,  p-value: < 2.2e-16

Data Learning II – Regression

PENG Heng

2022-11-14

**For any regression coefficient, you will generally be interested in three things: sign, size, and significance.**

Common Regression Mistakes

Example: Data from The Fifth National Bank of Springfield for gender discrimination suit