NOTE: For your homework download and use the template (https://math.dartmouth.edu/~m50f17/HW6.Rmd)
Read the green comments in the rmd file to see where your answers should go.
windmill <- read.table("https://math.dartmouth.edu/~m50f17/windmill.csv", header=T)
plot(windmill$velocity, windmill$DC, xlab = "wind velocity", ylab = "DC current")
fit <- lm(DC~velocity, data = windmill)
abline(fit$coefficients, col="red")
summary(fit)
##
## Call:
## lm(formula = DC ~ velocity, data = windmill)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59869 -0.14099 0.06059 0.17262 0.32184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13088 0.12599 1.039 0.31
## velocity 0.24115 0.01905 12.659 7.55e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2361 on 23 degrees of freedom
## Multiple R-squared: 0.8745, Adjusted R-squared: 0.869
## F-statistic: 160.3 on 1 and 23 DF, p-value: 7.546e-12
plot(fitted.values(fit), rstudent(fit), xlab = "y", ylab = "R-Student residuals", main = "Windmill - Residual Plot")
abline(c(0,0), col="red")
fit2 <- lm(DC~poly(velocity, degree = 2), data = windmill)
summary(fit2)
##
## Call:
## lm(formula = DC ~ poly(velocity, degree = 2), data = windmill)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26347 -0.02537 0.01264 0.03908 0.19903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60960 0.02453 65.605 < 2e-16 ***
## poly(velocity, degree = 2)1 2.98825 0.12267 24.359 < 2e-16 ***
## poly(velocity, degree = 2)2 -0.97493 0.12267 -7.947 6.59e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1227 on 22 degrees of freedom
## Multiple R-squared: 0.9676, Adjusted R-squared: 0.9646
## F-statistic: 328.3 on 2 and 22 DF, p-value: < 2.2e-16
plot(windmill$velocity, windmill$DC, xlab = "wind velocity", ylab = "DC current")
lines(sort(windmill$velocity), fitted(fit2)[order(windmill$velocity)], col='red')
velRep = 1/windmill$velocity
DC <- windmill$DC
plot(velRep, windmill$DC, xlab = "1/velocity", ylab = "DC current")
fit3 <- lm(DC~velRep)
abline(fit3$coefficients, col="red")
summary(fit3)
##
## Call:
## lm(formula = DC ~ velRep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20547 -0.04940 0.01100 0.08352 0.12204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9789 0.0449 66.34 <2e-16 ***
## velRep -6.9345 0.2064 -33.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09417 on 23 degrees of freedom
## Multiple R-squared: 0.98, Adjusted R-squared: 0.9792
## F-statistic: 1128 on 1 and 23 DF, p-value: < 2.2e-16
plot(fitted.values(fit), rstudent(fit3), xlab = "fitted values", ylab = "Studentized residuals", main = "Residuals - reciprocal model")
abline(c(0,0), col="red")
Recall the phytoplankton population data is given at : https://math.dartmouth.edu/~m50f17/phytoplankton.csv
where headers are
Plot the scatter diagram for pop ~ subs2. Do you think a straight line model is adequate? Fit a straight line model and support your argument with summary statistics.
Do you suggest to use Box-Cox method? If not explain, if so apply the method and demonstrate the improvement.
An analyst suggests to use the following model \[ y = \beta_0 + \beta_1 (x-4.5)^2 \] Using transformations, fit a simple linear regression model. Plot the scatter diagram and fitted curve (Note: it is not a straight line in this case). Compare \(MS_{res}\), \(R^2\) and the R-student residual plots with the model in part a.
Construct the probability plot for part (c). Is there a problem with the normality assumption? If so determine the problem (heavy tailed, light tailed, or something else)
pData <- read.table("https://math.dartmouth.edu/~m50f17/phytoplankton.csv", header=T, sep=",")
pop <- pData$pop
subs1 <- pData$subs1
subs2 <- pData$subs2
fitted=lm(pop~subs2)
plot (subs2, pop)
abline(fitted$coefficients)
qqnorm(rstudent(fitted), datax = TRUE ,
main="Normal Probability Plot")
qqline(rstudent(fitted), datax = TRUE )
summary(fitted)
##
## Call:
## lm(formula = pop ~ subs2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.645 -12.577 3.176 16.593 33.879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 201.250 8.672 23.206 < 2e-16 ***
## subs2 -13.712 1.759 -7.793 1.06e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.14 on 148 degrees of freedom
## Multiple R-squared: 0.291, Adjusted R-squared: 0.2862
## F-statistic: 60.74 on 1 and 148 DF, p-value: 1.063e-12
cat("Part a.
Straight line model doesn't seem to be adequate which can be seen visually, also verly low R2.")
## Part a.
## Straight line model doesn't seem to be adequate which can be seen visually, also verly low R2.
library(MASS)
bc <- boxcox(pop ~ subs2, lambda = seq(1,6,0.1))
lambda <- bc$x[which.max(bc$y)]
fittedBC=lm(pop^lambda ~subs2)
summary(fittedBC)
##
## Call:
## lm(formula = pop^lambda ~ subs2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -212157 -79978 -1759 79359 208261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 610951 43847 13.934 < 2e-16 ***
## subs2 -62380 8896 -7.013 7.75e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101800 on 148 degrees of freedom
## Multiple R-squared: 0.2494, Adjusted R-squared: 0.2443
## F-statistic: 49.18 on 1 and 148 DF, p-value: 7.755e-11
plot (subs2, pop^lambda)
abline(fittedBC$coefficients)
plot(predict(fitted), rstudent(fitted) )
plot(predict(fittedBC), rstudent(fittedBC) )
cat("Part b.
Both normality and constant variance seems violated from the plots, however these might be due to fitting a straight line model to a nonlinear relation. After trying Box-Cox, even though R2 improved significantly we can't see improvement in normality, also constant variance assumption seems worse after boxcox, and still not helpful for the nonlinear relation.")
## Part b.
## Both normality and constant variance seems violated from the plots, however these might be due to fitting a straight line model to a nonlinear relation. After trying Box-Cox, even though R2 improved significantly we can't see improvement in normality, also constant variance assumption seems worse after boxcox, and still not helpful for the nonlinear relation.
s2Rep = (subs2-4.5)^2
popT <- pop
fit2 <- lm ( popT ~ s2Rep^2 )
summary(fit2)
##
## Call:
## lm(formula = popT ~ s2Rep^2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.484 -5.751 2.395 7.951 20.161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 154.7877 1.3857 111.70 <2e-16 ***
## s2Rep -20.1264 0.9774 -20.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.16 on 148 degrees of freedom
## Multiple R-squared: 0.7413, Adjusted R-squared: 0.7395
## F-statistic: 424.1 on 1 and 148 DF, p-value: < 2.2e-16
plot (subs2, popT)
lines(sort(subs2), fitted(fit2)[order(subs2)], col='red')
plot(fitted.values(fit2), rstudent(fit2), xlab = "y", ylab = "R-Student residuals")
abline(c(0,0), col="red")
summary(fit2)
##
## Call:
## lm(formula = popT ~ s2Rep^2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.484 -5.751 2.395 7.951 20.161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 154.7877 1.3857 111.70 <2e-16 ***
## s2Rep -20.1264 0.9774 -20.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.16 on 148 degrees of freedom
## Multiple R-squared: 0.7413, Adjusted R-squared: 0.7395
## F-statistic: 424.1 on 1 and 148 DF, p-value: < 2.2e-16
cat("Part c.
With the new model, plots suggests improvement in the fitting, significant increase in R2 and improvement in constant variance issues. However normality is still a problem. See below")
## Part c.
## With the new model, plots suggests improvement in the fitting, significant increase in R2 and improvement in constant variance issues. However normality is still a problem. See below
qqnorm(rstudent(fit2), datax = TRUE ,
main="Normal Probability Plot")
qqline(rstudent(fit2), datax = TRUE )
cat("Part d.
Normality assumption seems to be violated. It is heavy tailed on the left and light tailed on the right. ")
## Part d.
## Normality assumption seems to be violated. It is heavy tailed on the left and light tailed on the right.
Chapter 5, Problem 2 all parts.
library(MPV)
##
## Attaching package: 'MPV'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## stackloss
data(p5.2)
pressure <-p5.2$vapor
temperature <- p5.2$temp
plot(temperature, pressure)
cat("Part a.
A Clear nonlinear relations.")
## Part a.
## A Clear nonlinear relations.
fitT <- lm( pressure ~ temperature)
plot(temperature,pressure)
abline(fitT$coefficients)
plot( predict(fitT), rstudent(fitT), main = "R Student Residuals vs y")
abline(c(0,0), col="red")
summary(fitT)
##
## Call:
## lm(formula = pressure ~ temperature)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.63 -92.10 -37.66 64.33 222.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1956.258 363.807 -5.377 0.000446 ***
## temperature 6.686 1.121 5.964 0.000212 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 117.6 on 9 degrees of freedom
## Multiple R-squared: 0.7981, Adjusted R-squared: 0.7756
## F-statistic: 35.57 on 1 and 9 DF, p-value: 0.0002117
cat("Part b.
We can see nonlinear relation in the residuals also, normality assumption doesn't seem to be valid. Straight line model is inadequate overall.")
## Part b.
## We can see nonlinear relation in the residuals also, normality assumption doesn't seem to be valid. Straight line model is inadequate overall.
tInv <- (1/temperature)
logPressure <- log(pressure)
fitTInv <- lm( logPressure ~ tInv)
plot(tInv,logPressure)
abline(fitTInv$coefficients)
plot( predict(fitTInv), rstudent(fitTInv), main = "R Student Residuals vs y (Temp Inv)")
abline(c(0,0), col="red")
summary(fitTInv)
##
## Call:
## lm(formula = logPressure ~ tInv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.030999 -0.013107 0.004863 0.016735 0.021260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.061e+01 6.325e-02 325.8 <2e-16 ***
## tInv -5.201e+03 2.014e+01 -258.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02067 on 9 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 6.672e+04 on 1 and 9 DF, p-value: < 2.2e-16
cat("Part c.
Fitting seems better with this model, significant improvement in R2 and overall fit visually. Non linear relation is now seems to be tranformed into a linear relation. However residuals doesn't seem to be normal and independently distributed, seems correllated (however this might be result of the nonlinear relation even after the transformation). ")
## Part c.
## Fitting seems better with this model, significant improvement in R2 and overall fit visually. Non linear relation is now seems to be tranformed into a linear relation. However residuals doesn't seem to be normal and independently distributed, seems correllated (however this might be result of the nonlinear relation even after the transformation).
Chapter 5, Problem 3 all parts. Note: In part (c) consider natural log of the minutes number of bacteria.
data(p5.3)
numBacteria <- p5.3$bact
minutes <- p5.3$min
plot(minutes, numBacteria)
cat("Part a.
There might be a slightly nonlinear relation between minutes and number of bacteria. Which will be more apparent below. ")
## Part a.
## There might be a slightly nonlinear relation between minutes and number of bacteria. Which will be more apparent below.
fit <- lm( numBacteria ~ minutes )
plot(minutes , numBacteria)
abline(fit$coefficients)
plot( predict(fit), rstudent(fit), main = "R Student Residuals vs y")
abline(c(0,0), col="red")
summary(fit)
##
## Call:
## lm(formula = numBacteria ~ minutes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.323 -9.890 -7.323 2.463 45.282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 142.20 11.26 12.627 1.81e-07 ***
## minutes -12.48 1.53 -8.155 9.94e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.3 on 10 degrees of freedom
## Multiple R-squared: 0.8693, Adjusted R-squared: 0.8562
## F-statistic: 66.51 on 1 and 10 DF, p-value: 9.944e-06
qqnorm(rstudent(fit), datax = TRUE ,
main="Normal Probability Plot")
qqline(rstudent(fit), datax = TRUE )
cat("Part b.
There is a clear outlier. Normality assumption seems violated, nonlinear pattern more visible in residual plot. ")
## Part b.
## There is a clear outlier. Normality assumption seems violated, nonlinear pattern more visible in residual plot.
fitted <- lm( log(numBacteria) ~ (minutes) )
plot( predict(fitted), rstudent(fitted), main = "R Student Residuals vs y")
abline(c(0,0), col="red")
summary(fitted)
##
## Call:
## lm(formula = log(numBacteria) ~ (minutes))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.184303 -0.083994 0.001453 0.072825 0.206246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.33878 0.07409 72.05 6.47e-15 ***
## minutes -0.23617 0.01007 -23.46 4.49e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1204 on 10 degrees of freedom
## Multiple R-squared: 0.9822, Adjusted R-squared: 0.9804
## F-statistic: 550.3 on 1 and 10 DF, p-value: 4.489e-10
plot((minutes), log(numBacteria))
abline(fitted$coefficients)
qqnorm(rstudent(fitted), datax = TRUE ,
main="Normal Probability Plot")
qqline(rstudent(fitted), datax = TRUE )
cat("Part c.
The plots and summary shows much better fitting with high R2. Residual plot seems fine and fitting is satisfactory. Normality assumption also seems fine.")
## Part c.
## The plots and summary shows much better fitting with high R2. Residual plot seems fine and fitting is satisfactory. Normality assumption also seems fine.