Math 50 Fall 2017, Homework #5

NOTE: For your homework download and use the template (https://math.dartmouth.edu/~m50f17/HW5.Rmd)

Read the green comments in the rmd file to see where your answers should go.

Normal probability plots

You can use qqnorm and qqline functions to plot probability plots of residuals. The functions rstandard and rstudent calculate the standardized residuals and R-student residuals, respectively.

prop = read.table("https://math.dartmouth.edu/~m50f17/propellant.csv", header=T, sep=",")
age <- prop$Age
shearS <- prop$ShearS

fitted = lm(shearS ~ age)

stdRes = rstandard(fitted)
rStuRes = rstudent(fitted)

qqnorm(rStuRes, main="Normal Probability Plot (residuals on vertical axis)") 
qqline(rStuRes)

However, note that the book uses residuals on the x-axis instead of y-axis. In order to obtain that use the parameter datax as shown below. In the below graph x-axis denotes the R-student residuals and the y-axis is the theoretical quantiles ( in the book y-axis is probability instead of quantiles).

qqnorm(rStuRes, datax = TRUE ,
       main="Normal Probability Plot") 
qqline(rStuRes, datax = TRUE )

Residual vs predicted response plot

yHat <- predict(fitted)

plot (yHat, rStuRes)
abline(0,0)

After observing that the observation points 5 and 6 look like potential outliers, next we delete those points and compare the fitted model of the deleted data with the full data.

plot(age, shearS, xlim=c(0,30), ylim=c(1600,2700))
abline(fitted$coef, lwd = 2, col = "blue")

ageRem <- age[-6] 
ageRem <- ageRem[-5] 

shearSRem <- shearS[-6]
shearSRem <- shearSRem[-5]

fitted2 = lm(shearSRem ~ ageRem)
abline(fitted2$coef, lwd = 2, col = "red")

A note

There is a dedicated library :

MPV: Data Sets from Montgomery, Peck and Vining’s Book

in order to provide an easy way to load tables from the book. To install the library type :

install.packages(“MPV”)

Below is an example how to use this library. Check https://cran.r-project.org/web/packages/MPV/MPV.pdf for table names.

library(MPV)

## 
## Attaching package: 'MPV'

## The following object is masked from 'package:datasets':
## 
##     stackloss

data(table.b1)

y <- table.b1$y
x1 <- table.b1$x1
x3 <- table.b1$x3
x8 <- table.b1$x8

y.lm <- lm(y ~ x1 + x3 + x8)
summary(y.lm)

## 
## Call:
## lm(formula = y ~ x1 + x3 + x8)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6835 -1.5609 -0.1448  1.7188  4.0386 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 12.136744   9.145612   1.327  0.19698   
## x1           0.001123   0.001938   0.580  0.56762   
## x3           0.159582   0.296318   0.539  0.59516   
## x8          -0.006496   0.002088  -3.112  0.00475 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.42 on 24 degrees of freedom
## Multiple R-squared:  0.5702, Adjusted R-squared:  0.5164 
## F-statistic: 10.61 on 3 and 24 DF,  p-value: 0.0001244

Question-1

Solve the parts (a), (b) and (c) of Problem 4.1. In addition answer the following.

Is it possible to perform lack of fit test using the steps (4.20) to (4.24) ?

Answer:

library("MPV")

fitted = lm(y~x8, data=table.b1)

cat("Part (a). 
    There are some minor deviations but in general it seems OK and we can say no problem with the normality assumptions.")

## Part (a). 
##     There are some minor deviations but in general it seems OK and we can say no problem with the normality assumptions.

rStuRes <- rstudent(fitted)
qqnorm(rStuRes, datax = TRUE, main = "Normal Probability Plot")
qqline(rStuRes, datax = TRUE  )

cat("Part (b). 
    Plot seems in a band and it doesn't suggest any violations of our assumptions.")

## Part (b). 
##     Plot seems in a band and it doesn't suggest any violations of our assumptions.

yHat <- predict(fitted)
plot(yHat, rStuRes)
abline(0, 0)

cat("Part (c). 
     Plot shows a structure (trend in different parts) so we conclude that adding x2 might contribute to the model.")

## Part (c). 
##      Plot shows a structure (trend in different parts) so we conclude that adding x2 might contribute to the model.

plot(table.b1$x2, rStuRes)
abline(0, 0)

cat("Part (d). 
     The lack of fit test explained in steps (4.20) to (4.24) requires repeated observations, therefore not possible to do in this case.")

## Part (d). 
##      The lack of fit test explained in steps (4.20) to (4.24) requires repeated observations, therefore not possible to do in this case.

Question-2

Chapter 4, Problem 2 all parts.

Answer:

cat("Part (a). 
    There are some deviations from the normality in the first part. First half above and second half seems to be below the normal distribution line (qq-line).")

## Part (a). 
##     There are some deviations from the normality in the first part. First half above and second half seems to be below the normal distribution line (qq-line).

fitted <- lm(y ~ x2 + x7 + x8, data=table.b1)
rStuRes <- rstudent(fitted)
qqnorm(rStuRes, datax = TRUE, main = "Normal Probability Plot")
qqline(rStuRes, datax = TRUE)

cat("Part (b). 
 This also seems in a band with no apparent violations in our assumptions from this plot.")

## Part (b). 
##  This also seems in a band with no apparent violations in our assumptions from this plot.

yhat <- predict(fitted)
plot(yhat, rStuRes)
abline(0, 0)

cat("Part (c).
    The plots of residuals of x7 shows non-constant variance (variance increasing in the first half). Plot against x2 has also some narrowing in the middle and suggests non-constant variance. On the other hand plot against x8 is a little better, although it shows some narrowing towards ends there are too few points (only two points in each end). ")

## Part (c).
##     The plots of residuals of x7 shows non-constant variance (variance increasing in the first half). Plot against x2 has also some narrowing in the middle and suggests non-constant variance. On the other hand plot against x8 is a little better, although it shows some narrowing towards ends there are too few points (only two points in each end).

plot(table.b1$x2, rStuRes)
abline(0, 0)

plot(table.b1$x7, rStuRes)
abline(0, 0)

plot(table.b1$x8, rStuRes)
abline(0, 0)

cat("Part (d).
    We can see the linear relation between y and the regressors x2 and x8. For the regressor x7 this is less apparent. ")

## Part (d).
##     We can see the linear relation between y and the regressors x2 and x8. For the regressor x7 this is less apparent.

fit.y_x2deleted <- lm(y ~ x7 + x8, data=table.b1)
res.y_x2deleted <- y - predict(fit.y_x2deleted)
fit.x2 <- lm(x2 ~ x7 + x8, data=table.b1)
res.x2 <- table.b1$x2 - predict(fit.x2)
plot(res.y_x2deleted, res.x2)

fit.y_x7deleted <- lm(y ~ x2 + x8, data=table.b1)
res.y_x7deleted <- y - predict(fit.y_x7deleted)
fit.x7 <- lm(x7 ~ x2 + x8, data=table.b1)
res.x7 <- table.b1$x7 - predict(fit.x7)
plot(res.y_x7deleted, res.x7)

fit.y_x8deleted <- lm(y ~ x2 + x7, data=table.b1)
res.y_x8deleted <- y - predict(fit.y_x8deleted)
fit.x8 <- lm(x8 ~ x2 + x7, data=table.b1)
res.x8 <- table.b1$x8 - predict(fit.x8)
plot(res.y_x8deleted, res.x8 )

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:MPV':
## 
##     cement

stuRes <- studres(fitted)

cat("Part (e).
    No R-Student is above our threshold 3, the observation point with 2.54 R-student residual can be considered as an outlier.  The difference between the studentized and R-student residuals is small (see below). This shows Si^2 and MSres are not significantly different than from each other and this suggest that there is no influential point with a strong  influence on variance estimation.")

## Part (e).
##     No R-Student is above our threshold 3, the observation point with 2.54 R-student residual can be considered as an outlier.  The difference between the studentized and R-student residuals is small (see below). This shows Si^2 and MSres are not significantly different than from each other and this suggest that there is no influential point with a strong  influence on variance estimation.

summary(rStuRes)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.980521 -0.444398 -0.124731  0.003299  0.680575  2.454354

summary(rStuRes-stuRes)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -4.441e-16 -3.469e-18  0.000e+00  2.776e-17  6.939e-17  4.441e-16

Question-3

Chapter 4, Problem 19 all parts. In addition answer the following.

Find the point with largest (in absolute value) r-student residual as a potential outlier. Repeat the regression analysis after deleting that point from the observation data. Construct the probability plot and residual vs predicted response plot. Calculate the differences (deleted vs full data) in fitted coefficients, \(MS_{res}\) and \(R^2\). Comment on the differences in the plots and the values. Do you think it is an influential point? Do they imply any improvement?

Answer:

data(p4.19)
fitted <- lm( y ~ x1+x2+x3, data=p4.19)
stdRes <- rstandard(fitted)
rStuRes <- rstudent(fitted)

cat("Part a.
    The normality assumtions seems to be valid : ")

## Part a.
##     The normality assumtions seems to be valid :

qqnorm(rStuRes, datax = TRUE , main="Normal Probability Plot") 
qqline(rStuRes, datax = TRUE )

cat("However in the below plot we see a violation of the linearity assumption, (as the behaviour in the middle suggests that). Similar to above one of the observations even though not exactly above our cut-off value 3, it is separated from the rest (absolute value of its R-Student) and therefore can be considered as an outlier. ")

## However in the below plot we see a violation of the linearity assumption, (as the behaviour in the middle suggests that). Similar to above one of the observations even though not exactly above our cut-off value 3, it is separated from the rest (absolute value of its R-Student) and therefore can be considered as an outlier.

yHat <- predict(fitted)
plot(yHat, rStuRes)
abline(0, 0)

# Only 6 data repeated at:  x=0
# So we can calculate directly 
repeatedY_atX0 <- c(133, 133, 140, 142, 145, 142)
SS_PE <- sum((repeatedY_atX0 - mean(repeatedY_atX0))^2)
SSres <- sum( (p4.19$y - yHat)^2 );
SS_LOF <- SSres - SS_PE

m <- 9
p <- 4
n <- length(yHat)
F_lof <- ((SS_LOF/(m-p)))/(SS_PE/(n-m))
fQuant <- qf(0.95,  m-p, n-m) 

cat ("Part b.
      Fo value is ",F_lof ,"  and this is slighlt greater than the critical quantile value (at significance level 0.05)",fQuant, "
       Thus we conclude that lack-of-fit test suggest that linear model is not adequate.")

## Part b.
##       Fo value is  11.80688   and this is slighlt greater than the critical quantile value (at significance level 0.05) 5.050329 
##        Thus we conclude that lack-of-fit test suggest that linear model is not adequate.

deletedData <- p4.19[-(which(rStuRes==max(rStuRes))), ]
delFit <- lm(y~x1+x2+x3, data=deletedData)
rStuResD = rstudent(delFit)

qqnorm(rStuResD,  datax = TRUE , main="Normal Probability Plot") 
qqline(rStuResD,  datax = TRUE )

yHatD <- predict(delFit)
plot(yHatD, rStuResD)
abline(0,0)

MSres <- sum( (p4.19$y - yHat)^2)/(length(yHat)-4)
MSresD <- sum( (deletedData$y-yHatD)^2)/(length(yHatD)-4)

summary(fitted)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = p4.19)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.518  -8.768  -1.143   7.857  20.232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  134.143      3.406  39.382 2.66e-12 ***
## x1            16.875      4.506   3.745  0.00382 ** 
## x2            16.125      4.506   3.579  0.00502 ** 
## x3            10.625      4.506   2.358  0.04009 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.74 on 10 degrees of freedom
## Multiple R-squared:  0.7641, Adjusted R-squared:  0.6933 
## F-statistic:  10.8 on 3 and 10 DF,  p-value: 0.001772

summary(delFit)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = deletedData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.726  -6.339   1.387   8.468  13.468 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  131.532      2.816  46.711 4.73e-12 ***
## x1            12.306      3.881   3.171   0.0114 *  
## x2            11.556      3.881   2.977   0.0155 *  
## x3             6.056      3.881   1.560   0.1531    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.916 on 9 degrees of freedom
## Multiple R-squared:  0.6452, Adjusted R-squared:  0.527 
## F-statistic: 5.456 on 3 and 9 DF,  p-value: 0.02055

cat("Part c.
     As we see from summary above R2 values dropped. MSres before was ", MSres, " 
     and after deletion it is  ", MSresD, "
    , There is a considerable drop in MSres. 
     We can say removing this observation point have some improvement in MSres, 
     but this is expected since it was a potential outlier. Howver in
     x-space removed point is not well separated from the rest (see table), thus not an influential point.")

## Part c.
##      As we see from summary above R2 values dropped. MSres before was  162.4339  
##      and after deletion it is   98.32079 
##     , There is a considerable drop in MSres. 
##      We can say removing this observation point have some improvement in MSres, 
##      but this is expected since it was a potential outlier. Howver in
##      x-space removed point is not well separated from the rest (see table), thus not an influential point.