NOTE: For your homework download and use the template (https://math.dartmouth.edu/~m50f17/HW5.Rmd)
Read the green comments in the rmd file to see where your answers should go.
You can use qqnorm and qqline functions to plot probability plots of residuals. The functions rstandard and rstudent calculate the standardized residuals and R-student residuals, respectively.
prop = read.table("https://math.dartmouth.edu/~m50f17/propellant.csv", header=T, sep=",")
age <- prop$Age
shearS <- prop$ShearS
fitted = lm(shearS ~ age)
stdRes = rstandard(fitted)
rStuRes = rstudent(fitted)
qqnorm(rStuRes, main="Normal Probability Plot (residuals on vertical axis)")
qqline(rStuRes)
However, note that the book uses residuals on the x-axis instead of y-axis. In order to obtain that use the parameter datax as shown below. In the below graph x-axis denotes the R-student residuals and the y-axis is the theoretical quantiles ( in the book y-axis is probability instead of quantiles).
qqnorm(rStuRes, datax = TRUE ,
main="Normal Probability Plot")
qqline(rStuRes, datax = TRUE )
yHat <- predict(fitted)
plot (yHat, rStuRes)
abline(0,0)
After observing that the observation points 5 and 6 look like potential outliers, next we delete those points and compare the fitted model of the deleted data with the full data.
plot(age, shearS, xlim=c(0,30), ylim=c(1600,2700))
abline(fitted$coef, lwd = 2, col = "blue")
ageRem <- age[-6]
ageRem <- ageRem[-5]
shearSRem <- shearS[-6]
shearSRem <- shearSRem[-5]
fitted2 = lm(shearSRem ~ ageRem)
abline(fitted2$coef, lwd = 2, col = "red")
There is a dedicated library :
MPV: Data Sets from Montgomery, Peck and Vining’s Book
in order to provide an easy way to load tables from the book. To install the library type :
install.packages(“MPV”)
Below is an example how to use this library. Check https://cran.r-project.org/web/packages/MPV/MPV.pdf for table names.
library(MPV)
##
## Attaching package: 'MPV'
## The following object is masked from 'package:datasets':
##
## stackloss
data(table.b1)
y <- table.b1$y
x1 <- table.b1$x1
x3 <- table.b1$x3
x8 <- table.b1$x8
y.lm <- lm(y ~ x1 + x3 + x8)
summary(y.lm)
##
## Call:
## lm(formula = y ~ x1 + x3 + x8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6835 -1.5609 -0.1448 1.7188 4.0386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.136744 9.145612 1.327 0.19698
## x1 0.001123 0.001938 0.580 0.56762
## x3 0.159582 0.296318 0.539 0.59516
## x8 -0.006496 0.002088 -3.112 0.00475 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.42 on 24 degrees of freedom
## Multiple R-squared: 0.5702, Adjusted R-squared: 0.5164
## F-statistic: 10.61 on 3 and 24 DF, p-value: 0.0001244
Solve the parts (a), (b) and (c) of Problem 4.1. In addition answer the following.
library("MPV")
fitted = lm(y~x8, data=table.b1)
cat("Part (a).
There are some minor deviations but in general it seems OK and we can say no problem with the normality assumptions.")
## Part (a).
## There are some minor deviations but in general it seems OK and we can say no problem with the normality assumptions.
rStuRes <- rstudent(fitted)
qqnorm(rStuRes, datax = TRUE, main = "Normal Probability Plot")
qqline(rStuRes, datax = TRUE )
cat("Part (b).
Plot seems in a band and it doesn't suggest any violations of our assumptions.")
## Part (b).
## Plot seems in a band and it doesn't suggest any violations of our assumptions.
yHat <- predict(fitted)
plot(yHat, rStuRes)
abline(0, 0)
cat("Part (c).
Plot shows a structure (trend in different parts) so we conclude that adding x2 might contribute to the model.")
## Part (c).
## Plot shows a structure (trend in different parts) so we conclude that adding x2 might contribute to the model.
plot(table.b1$x2, rStuRes)
abline(0, 0)
cat("Part (d).
The lack of fit test explained in steps (4.20) to (4.24) requires repeated observations, therefore not possible to do in this case.")
## Part (d).
## The lack of fit test explained in steps (4.20) to (4.24) requires repeated observations, therefore not possible to do in this case.
Chapter 4, Problem 2 all parts.
cat("Part (a).
There are some deviations from the normality in the first part. First half above and second half seems to be below the normal distribution line (qq-line).")
## Part (a).
## There are some deviations from the normality in the first part. First half above and second half seems to be below the normal distribution line (qq-line).
fitted <- lm(y ~ x2 + x7 + x8, data=table.b1)
rStuRes <- rstudent(fitted)
qqnorm(rStuRes, datax = TRUE, main = "Normal Probability Plot")
qqline(rStuRes, datax = TRUE)
cat("Part (b).
This also seems in a band with no apparent violations in our assumptions from this plot.")
## Part (b).
## This also seems in a band with no apparent violations in our assumptions from this plot.
yhat <- predict(fitted)
plot(yhat, rStuRes)
abline(0, 0)
cat("Part (c).
The plots of residuals of x7 shows non-constant variance (variance increasing in the first half). Plot against x2 has also some narrowing in the middle and suggests non-constant variance. On the other hand plot against x8 is a little better, although it shows some narrowing towards ends there are too few points (only two points in each end). ")
## Part (c).
## The plots of residuals of x7 shows non-constant variance (variance increasing in the first half). Plot against x2 has also some narrowing in the middle and suggests non-constant variance. On the other hand plot against x8 is a little better, although it shows some narrowing towards ends there are too few points (only two points in each end).
plot(table.b1$x2, rStuRes)
abline(0, 0)
plot(table.b1$x7, rStuRes)
abline(0, 0)
plot(table.b1$x8, rStuRes)
abline(0, 0)
cat("Part (d).
We can see the linear relation between y and the regressors x2 and x8. For the regressor x7 this is less apparent. ")
## Part (d).
## We can see the linear relation between y and the regressors x2 and x8. For the regressor x7 this is less apparent.
fit.y_x2deleted <- lm(y ~ x7 + x8, data=table.b1)
res.y_x2deleted <- y - predict(fit.y_x2deleted)
fit.x2 <- lm(x2 ~ x7 + x8, data=table.b1)
res.x2 <- table.b1$x2 - predict(fit.x2)
plot(res.y_x2deleted, res.x2)
fit.y_x7deleted <- lm(y ~ x2 + x8, data=table.b1)
res.y_x7deleted <- y - predict(fit.y_x7deleted)
fit.x7 <- lm(x7 ~ x2 + x8, data=table.b1)
res.x7 <- table.b1$x7 - predict(fit.x7)
plot(res.y_x7deleted, res.x7)
fit.y_x8deleted <- lm(y ~ x2 + x7, data=table.b1)
res.y_x8deleted <- y - predict(fit.y_x8deleted)
fit.x8 <- lm(x8 ~ x2 + x7, data=table.b1)
res.x8 <- table.b1$x8 - predict(fit.x8)
plot(res.y_x8deleted, res.x8 )
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:MPV':
##
## cement
stuRes <- studres(fitted)
cat("Part (e).
No R-Student is above our threshold 3, the observation point with 2.54 R-student residual can be considered as an outlier. The difference between the studentized and R-student residuals is small (see below). This shows Si^2 and MSres are not significantly different than from each other and this suggest that there is no influential point with a strong influence on variance estimation.")
## Part (e).
## No R-Student is above our threshold 3, the observation point with 2.54 R-student residual can be considered as an outlier. The difference between the studentized and R-student residuals is small (see below). This shows Si^2 and MSres are not significantly different than from each other and this suggest that there is no influential point with a strong influence on variance estimation.
summary(rStuRes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.980521 -0.444398 -0.124731 0.003299 0.680575 2.454354
summary(rStuRes-stuRes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.441e-16 -3.469e-18 0.000e+00 2.776e-17 6.939e-17 4.441e-16
Chapter 4, Problem 19 all parts. In addition answer the following.
data(p4.19)
fitted <- lm( y ~ x1+x2+x3, data=p4.19)
stdRes <- rstandard(fitted)
rStuRes <- rstudent(fitted)
cat("Part a.
The normality assumtions seems to be valid : ")
## Part a.
## The normality assumtions seems to be valid :
qqnorm(rStuRes, datax = TRUE , main="Normal Probability Plot")
qqline(rStuRes, datax = TRUE )
cat("However in the below plot we see a violation of the linearity assumption, (as the behaviour in the middle suggests that). Similar to above one of the observations even though not exactly above our cut-off value 3, it is separated from the rest (absolute value of its R-Student) and therefore can be considered as an outlier. ")
## However in the below plot we see a violation of the linearity assumption, (as the behaviour in the middle suggests that). Similar to above one of the observations even though not exactly above our cut-off value 3, it is separated from the rest (absolute value of its R-Student) and therefore can be considered as an outlier.
yHat <- predict(fitted)
plot(yHat, rStuRes)
abline(0, 0)
# Only 6 data repeated at: x=0
# So we can calculate directly
repeatedY_atX0 <- c(133, 133, 140, 142, 145, 142)
SS_PE <- sum((repeatedY_atX0 - mean(repeatedY_atX0))^2)
SSres <- sum( (p4.19$y - yHat)^2 );
SS_LOF <- SSres - SS_PE
m <- 9
p <- 4
n <- length(yHat)
F_lof <- ((SS_LOF/(m-p)))/(SS_PE/(n-m))
fQuant <- qf(0.95, m-p, n-m)
cat ("Part b.
Fo value is ",F_lof ," and this is slighlt greater than the critical quantile value (at significance level 0.05)",fQuant, "
Thus we conclude that lack-of-fit test suggest that linear model is not adequate.")
## Part b.
## Fo value is 11.80688 and this is slighlt greater than the critical quantile value (at significance level 0.05) 5.050329
## Thus we conclude that lack-of-fit test suggest that linear model is not adequate.
deletedData <- p4.19[-(which(rStuRes==max(rStuRes))), ]
delFit <- lm(y~x1+x2+x3, data=deletedData)
rStuResD = rstudent(delFit)
qqnorm(rStuResD, datax = TRUE , main="Normal Probability Plot")
qqline(rStuResD, datax = TRUE )
yHatD <- predict(delFit)
plot(yHatD, rStuResD)
abline(0,0)
MSres <- sum( (p4.19$y - yHat)^2)/(length(yHat)-4)
MSresD <- sum( (deletedData$y-yHatD)^2)/(length(yHatD)-4)
summary(fitted)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = p4.19)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.518 -8.768 -1.143 7.857 20.232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.143 3.406 39.382 2.66e-12 ***
## x1 16.875 4.506 3.745 0.00382 **
## x2 16.125 4.506 3.579 0.00502 **
## x3 10.625 4.506 2.358 0.04009 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.74 on 10 degrees of freedom
## Multiple R-squared: 0.7641, Adjusted R-squared: 0.6933
## F-statistic: 10.8 on 3 and 10 DF, p-value: 0.001772
summary(delFit)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = deletedData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.726 -6.339 1.387 8.468 13.468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 131.532 2.816 46.711 4.73e-12 ***
## x1 12.306 3.881 3.171 0.0114 *
## x2 11.556 3.881 2.977 0.0155 *
## x3 6.056 3.881 1.560 0.1531
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.916 on 9 degrees of freedom
## Multiple R-squared: 0.6452, Adjusted R-squared: 0.527
## F-statistic: 5.456 on 3 and 9 DF, p-value: 0.02055
cat("Part c.
As we see from summary above R2 values dropped. MSres before was ", MSres, "
and after deletion it is ", MSresD, "
, There is a considerable drop in MSres.
We can say removing this observation point have some improvement in MSres,
but this is expected since it was a potential outlier. Howver in
x-space removed point is not well separated from the rest (see table), thus not an influential point.")
## Part c.
## As we see from summary above R2 values dropped. MSres before was 162.4339
## and after deletion it is 98.32079
## , There is a considerable drop in MSres.
## We can say removing this observation point have some improvement in MSres,
## but this is expected since it was a potential outlier. Howver in
## x-space removed point is not well separated from the rest (see table), thus not an influential point.