Math 50 Fall 2017, Homework #6

NOTE: For your homework download and use the template (https://math.dartmouth.edu/~m50f17/HW6.Rmd)

Read the green comments in the rmd file to see where your answers should go.

Lets first look at the scatter plot for the windmill data, and visually check the straight line fit.

windmill <- read.table("https://math.dartmouth.edu/~m50f17/windmill.csv", header=T)
plot(windmill$velocity, windmill$DC, xlab = "wind velocity", ylab = "DC current")
fit <- lm(DC~velocity, data = windmill)
abline(fit$coefficients, col="red")

The summary statistics are below, \(R^2\) is about 0.87. The residual plot below suggests that the relation might be non-linear. When you look at the above scatter diagram one might think the straight line model seems OK, however the residual plot below amplifies the nonlinearity. Why? Can we also see this by carefully looking at the scatter plot above?

summary(fit)

## 
## Call:
## lm(formula = DC ~ velocity, data = windmill)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59869 -0.14099  0.06059  0.17262  0.32184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.13088    0.12599   1.039     0.31    
## velocity     0.24115    0.01905  12.659 7.55e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2361 on 23 degrees of freedom
## Multiple R-squared:  0.8745, Adjusted R-squared:  0.869 
## F-statistic: 160.3 on 1 and 23 DF,  p-value: 7.546e-12

plot(fitted.values(fit), rstudent(fit), xlab = "y", ylab = "R-Student residuals", main = "Windmill - Residual Plot")
abline(c(0,0), col="red")

Also note that it looks like that there is a potential outlier and however this might change when we fix the model. It seems consistent with the rest (visually). Start with fitting a quadratic model.

fit2 <- lm(DC~poly(velocity, degree = 2), data = windmill)
summary(fit2)

## 
## Call:
## lm(formula = DC ~ poly(velocity, degree = 2), data = windmill)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26347 -0.02537  0.01264  0.03908  0.19903 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.60960    0.02453  65.605  < 2e-16 ***
## poly(velocity, degree = 2)1  2.98825    0.12267  24.359  < 2e-16 ***
## poly(velocity, degree = 2)2 -0.97493    0.12267  -7.947 6.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1227 on 22 degrees of freedom
## Multiple R-squared:  0.9676, Adjusted R-squared:  0.9646 
## F-statistic: 328.3 on 2 and 22 DF,  p-value: < 2.2e-16

plot(windmill$velocity, windmill$DC, xlab = "wind velocity", ylab = "DC current")
 
lines(sort(windmill$velocity), fitted(fit2)[order(windmill$velocity)], col='red')

This seems to fix the curved nature of the data, however the application domain suggests to use a model of the form \[ y = \beta_0 + \beta_1 \frac{1}{x} + \varepsilon \] Note that there doesn’t seem a potential outlier in the new model.

velRep = 1/windmill$velocity
DC <- windmill$DC

plot(velRep, windmill$DC, xlab = "1/velocity", ylab = "DC current")
fit3 <- lm(DC~velRep)
abline(fit3$coefficients, col="red")

summary(fit3)

## 
## Call:
## lm(formula = DC ~ velRep)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20547 -0.04940  0.01100  0.08352  0.12204 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9789     0.0449   66.34   <2e-16 ***
## velRep       -6.9345     0.2064  -33.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09417 on 23 degrees of freedom
## Multiple R-squared:   0.98,  Adjusted R-squared:  0.9792 
## F-statistic:  1128 on 1 and 23 DF,  p-value: < 2.2e-16

plot(fitted.values(fit), rstudent(fit3), xlab = "fitted values", ylab = "Studentized residuals", main = "Residuals - reciprocal model")
abline(c(0,0), col="red")

Question-1

Recall the phytoplankton population data is given at : https://math.dartmouth.edu/~m50f17/phytoplankton.csv

where headers are

pop : population of phytoplankton (\(y\))
subs2 : concentration of substance-2 (\(x\))

Plot the scatter diagram for pop ~ subs2. Do you think a straight line model is adequate? Fit a straight line model and support your argument with summary statistics.
Do you suggest to use Box-Cox method? If not explain, if so apply the method and demonstrate the improvement.
An analyst suggests to use the following model \[ y = \beta_0 + \beta_1 (x-4.5)^2 \] Using transformations, fit a simple linear regression model. Plot the scatter diagram and fitted curve (Note: it is not a straight line in this case). Compare \(MS_{res}\), \(R^2\) and the R-student residual plots with the model in part a.
Construct the probability plot for part (c). Is there a problem with the normality assumption? If so determine the problem (heavy tailed, light tailed, or something else)

Answer:

pData <- read.table("https://math.dartmouth.edu/~m50f17/phytoplankton.csv", header=T, sep=",")
pop <- pData$pop
subs1 <- pData$subs1
subs2 <- pData$subs2


fitted=lm(pop~subs2)
plot (subs2, pop)
abline(fitted$coefficients)

qqnorm(rstudent(fitted), datax = TRUE ,
       main="Normal Probability Plot") 
qqline(rstudent(fitted), datax = TRUE )

summary(fitted)

## 
## Call:
## lm(formula = pop ~ subs2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.645 -12.577   3.176  16.593  33.879 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  201.250      8.672  23.206  < 2e-16 ***
## subs2        -13.712      1.759  -7.793 1.06e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.14 on 148 degrees of freedom
## Multiple R-squared:  0.291,  Adjusted R-squared:  0.2862 
## F-statistic: 60.74 on 1 and 148 DF,  p-value: 1.063e-12

cat("Part a.
     Straight line model doesn't seem to be adequate which can be seen visually, also verly low R2.")

## Part a.
##      Straight line model doesn't seem to be adequate which can be seen visually, also verly low R2.

library(MASS)
bc <- boxcox(pop ~ subs2, lambda = seq(1,6,0.1))

lambda <- bc$x[which.max(bc$y)]
fittedBC=lm(pop^lambda ~subs2)
summary(fittedBC)

## 
## Call:
## lm(formula = pop^lambda ~ subs2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -212157  -79978   -1759   79359  208261 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   610951      43847  13.934  < 2e-16 ***
## subs2         -62380       8896  -7.013 7.75e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101800 on 148 degrees of freedom
## Multiple R-squared:  0.2494, Adjusted R-squared:  0.2443 
## F-statistic: 49.18 on 1 and 148 DF,  p-value: 7.755e-11

plot (subs2, pop^lambda)
abline(fittedBC$coefficients)

plot(predict(fitted), rstudent(fitted) )

plot(predict(fittedBC), rstudent(fittedBC) )

cat("Part b.
     Both normality and constant variance seems violated from the plots, however these might be due to fitting a straight line model to a nonlinear relation. After trying Box-Cox, even though R2 improved significantly we can't see improvement in normality, also constant variance assumption seems worse after boxcox, and still not helpful for the nonlinear relation.")

## Part b.
##      Both normality and constant variance seems violated from the plots, however these might be due to fitting a straight line model to a nonlinear relation. After trying Box-Cox, even though R2 improved significantly we can't see improvement in normality, also constant variance assumption seems worse after boxcox, and still not helpful for the nonlinear relation.

s2Rep = (subs2-4.5)^2
popT <- pop
fit2 <- lm ( popT ~ s2Rep^2 )
summary(fit2)

## 
## Call:
## lm(formula = popT ~ s2Rep^2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.484  -5.751   2.395   7.951  20.161 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 154.7877     1.3857  111.70   <2e-16 ***
## s2Rep       -20.1264     0.9774  -20.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.16 on 148 degrees of freedom
## Multiple R-squared:  0.7413, Adjusted R-squared:  0.7395 
## F-statistic: 424.1 on 1 and 148 DF,  p-value: < 2.2e-16

plot (subs2, popT)
lines(sort(subs2), fitted(fit2)[order(subs2)], col='red')

plot(fitted.values(fit2), rstudent(fit2), xlab = "y", ylab = "R-Student residuals")
abline(c(0,0), col="red")

summary(fit2)

## 
## Call:
## lm(formula = popT ~ s2Rep^2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.484  -5.751   2.395   7.951  20.161 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 154.7877     1.3857  111.70   <2e-16 ***
## s2Rep       -20.1264     0.9774  -20.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.16 on 148 degrees of freedom
## Multiple R-squared:  0.7413, Adjusted R-squared:  0.7395 
## F-statistic: 424.1 on 1 and 148 DF,  p-value: < 2.2e-16

cat("Part c.
     With the new model, plots suggests improvement in the fitting, significant increase in R2 and improvement in constant variance issues. However normality is still a problem. See below")

## Part c.
##      With the new model, plots suggests improvement in the fitting, significant increase in R2 and improvement in constant variance issues. However normality is still a problem. See below

qqnorm(rstudent(fit2), datax = TRUE ,
       main="Normal Probability Plot") 
qqline(rstudent(fit2), datax = TRUE )

cat("Part d.
    Normality assumption seems to be violated. It is heavy tailed on the left and light tailed on the right. ")

## Part d.
##     Normality assumption seems to be violated. It is heavy tailed on the left and light tailed on the right.

Question-2

Chapter 5, Problem 2 all parts.

Answer:

library(MPV)

## 
## Attaching package: 'MPV'

## The following object is masked from 'package:MASS':
## 
##     cement

## The following object is masked from 'package:datasets':
## 
##     stackloss

data(p5.2)
pressure <-p5.2$vapor
temperature <- p5.2$temp
plot(temperature, pressure)

cat("Part a.
    A Clear nonlinear relations.")

## Part a.
##     A Clear nonlinear relations.

fitT <- lm( pressure ~ temperature)
plot(temperature,pressure)
abline(fitT$coefficients)

plot( predict(fitT), rstudent(fitT), main = "R Student Residuals vs y")
abline(c(0,0), col="red")

summary(fitT)

## 
## Call:
## lm(formula = pressure ~ temperature)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -120.63  -92.10  -37.66   64.33  222.55 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1956.258    363.807  -5.377 0.000446 ***
## temperature     6.686      1.121   5.964 0.000212 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 117.6 on 9 degrees of freedom
## Multiple R-squared:  0.7981, Adjusted R-squared:  0.7756 
## F-statistic: 35.57 on 1 and 9 DF,  p-value: 0.0002117

cat("Part b.
     We can see nonlinear relation in the residuals also, normality assumption doesn't seem to be valid. Straight line model is inadequate overall.")

## Part b.
##      We can see nonlinear relation in the residuals also, normality assumption doesn't seem to be valid. Straight line model is inadequate overall.

tInv <- (1/temperature)
logPressure <- log(pressure)
fitTInv <- lm( logPressure ~ tInv) 
plot(tInv,logPressure)
abline(fitTInv$coefficients)

plot( predict(fitTInv), rstudent(fitTInv), main = "R Student Residuals vs y (Temp Inv)")
abline(c(0,0), col="red")

summary(fitTInv)

## 
## Call:
## lm(formula = logPressure ~ tInv)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.030999 -0.013107  0.004863  0.016735  0.021260 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.061e+01  6.325e-02   325.8   <2e-16 ***
## tInv        -5.201e+03  2.014e+01  -258.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02067 on 9 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 6.672e+04 on 1 and 9 DF,  p-value: < 2.2e-16

cat("Part c.
     Fitting seems better with this model, significant improvement in R2 and overall fit visually. Non linear relation is now seems to be tranformed into a linear relation. However residuals doesn't seem to be normal and independently distributed, seems correllated (however this might be result of the nonlinear relation even after the transformation). ")

## Part c.
##      Fitting seems better with this model, significant improvement in R2 and overall fit visually. Non linear relation is now seems to be tranformed into a linear relation. However residuals doesn't seem to be normal and independently distributed, seems correllated (however this might be result of the nonlinear relation even after the transformation).

Question-3

Chapter 5, Problem 3 all parts. Note: In part (c) consider natural log of the ~~minutes~~ number of bacteria.

Answer:

data(p5.3)
numBacteria <- p5.3$bact
minutes <- p5.3$min

plot(minutes, numBacteria)

cat("Part a.
     There might be a slightly nonlinear relation between minutes and number of bacteria. Which will be more apparent below. ")

## Part a.
##      There might be a slightly nonlinear relation between minutes and number of bacteria. Which will be more apparent below.

fit <- lm( numBacteria ~ minutes )
plot(minutes , numBacteria)
abline(fit$coefficients)

plot( predict(fit), rstudent(fit), main = "R Student Residuals vs y")
abline(c(0,0), col="red")

summary(fit)

## 
## Call:
## lm(formula = numBacteria ~ minutes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.323  -9.890  -7.323   2.463  45.282 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   142.20      11.26  12.627 1.81e-07 ***
## minutes       -12.48       1.53  -8.155 9.94e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.3 on 10 degrees of freedom
## Multiple R-squared:  0.8693, Adjusted R-squared:  0.8562 
## F-statistic: 66.51 on 1 and 10 DF,  p-value: 9.944e-06

qqnorm(rstudent(fit), datax = TRUE ,
       main="Normal Probability Plot") 
qqline(rstudent(fit), datax = TRUE )

cat("Part b.
     There is a clear outlier. Normality assumption seems violated, nonlinear pattern more visible in residual plot. ")

## Part b.
##      There is a clear outlier. Normality assumption seems violated, nonlinear pattern more visible in residual plot.

fitted <- lm( log(numBacteria) ~ (minutes) )
plot( predict(fitted), rstudent(fitted), main = "R Student Residuals vs y")
abline(c(0,0), col="red")

summary(fitted)

## 
## Call:
## lm(formula = log(numBacteria) ~ (minutes))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.184303 -0.083994  0.001453  0.072825  0.206246 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.33878    0.07409   72.05 6.47e-15 ***
## minutes     -0.23617    0.01007  -23.46 4.49e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1204 on 10 degrees of freedom
## Multiple R-squared:  0.9822, Adjusted R-squared:  0.9804 
## F-statistic: 550.3 on 1 and 10 DF,  p-value: 4.489e-10

plot((minutes), log(numBacteria))
abline(fitted$coefficients)

qqnorm(rstudent(fitted), datax = TRUE ,
       main="Normal Probability Plot") 
qqline(rstudent(fitted), datax = TRUE )

cat("Part c.
     The plots and summary shows much better fitting with high R2. Residual plot seems fine and fitting is satisfactory. Normality assumption also seems fine.")

## Part c.
##      The plots and summary shows much better fitting with high R2. Residual plot seems fine and fitting is satisfactory. Normality assumption also seems fine.