NOTE: For your homework download and use the template (https://math.dartmouth.edu/~m50f17/HW6.Rmd)

Read the green comments in the rmd file to see where your answers should go.




Lets first look at the scatter plot for the windmill data, and visually check the straight line fit.

windmill <- read.table("https://math.dartmouth.edu/~m50f17/windmill.csv", header=T)
plot(windmill$velocity, windmill$DC, xlab = "wind velocity", ylab = "DC current")
fit <- lm(DC~velocity, data = windmill)
abline(fit$coefficients, col="red")

The summary statistics are below, \(R^2\) is about 0.87. The residual plot below suggests that the relation might be non-linear. When you look at the above scatter diagram one might think the straight line model seems OK, however the residual plot below amplifies the nonlinearity. Why? Can we also see this by carefully looking at the scatter plot above?

summary(fit)
## 
## Call:
## lm(formula = DC ~ velocity, data = windmill)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59869 -0.14099  0.06059  0.17262  0.32184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.13088    0.12599   1.039     0.31    
## velocity     0.24115    0.01905  12.659 7.55e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2361 on 23 degrees of freedom
## Multiple R-squared:  0.8745, Adjusted R-squared:  0.869 
## F-statistic: 160.3 on 1 and 23 DF,  p-value: 7.546e-12
plot(fitted.values(fit), rstudent(fit), xlab = "y", ylab = "R-Student residuals", main = "Windmill - Residual Plot")
abline(c(0,0), col="red")

Also note that it looks like that there is a potential outlier and however this might change when we fix the model. It seems consistent with the rest (visually). Start with fitting a quadratic model.

fit2 <- lm(DC~poly(velocity, degree = 2), data = windmill)
summary(fit2)
## 
## Call:
## lm(formula = DC ~ poly(velocity, degree = 2), data = windmill)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26347 -0.02537  0.01264  0.03908  0.19903 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.60960    0.02453  65.605  < 2e-16 ***
## poly(velocity, degree = 2)1  2.98825    0.12267  24.359  < 2e-16 ***
## poly(velocity, degree = 2)2 -0.97493    0.12267  -7.947 6.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1227 on 22 degrees of freedom
## Multiple R-squared:  0.9676, Adjusted R-squared:  0.9646 
## F-statistic: 328.3 on 2 and 22 DF,  p-value: < 2.2e-16
plot(windmill$velocity, windmill$DC, xlab = "wind velocity", ylab = "DC current")
 
lines(sort(windmill$velocity), fitted(fit2)[order(windmill$velocity)], col='red') 

This seems to fix the curved nature of the data, however the application domain suggests to use a model of the form \[ y = \beta_0 + \beta_1 \frac{1}{x} + \varepsilon \] Note that there doesn’t seem a potential outlier in the new model.

velRep = 1/windmill$velocity
DC <- windmill$DC

plot(velRep, windmill$DC, xlab = "1/velocity", ylab = "DC current")
fit3 <- lm(DC~velRep)
abline(fit3$coefficients, col="red")

summary(fit3)
## 
## Call:
## lm(formula = DC ~ velRep)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20547 -0.04940  0.01100  0.08352  0.12204 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9789     0.0449   66.34   <2e-16 ***
## velRep       -6.9345     0.2064  -33.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09417 on 23 degrees of freedom
## Multiple R-squared:   0.98,  Adjusted R-squared:  0.9792 
## F-statistic:  1128 on 1 and 23 DF,  p-value: < 2.2e-16
plot(fitted.values(fit), rstudent(fit3), xlab = "fitted values", ylab = "Studentized residuals", main = "Residuals - reciprocal model")
abline(c(0,0), col="red")




Question-1

Recall the phytoplankton population data is given at : https://math.dartmouth.edu/~m50f17/phytoplankton.csv

where headers are

  1. Plot the scatter diagram for pop ~ subs2. Do you think a straight line model is adequate? Fit a straight line model and support your argument with summary statistics.

  2. Do you suggest to use Box-Cox method? If not explain, if so apply the method and demonstrate the improvement.

  3. An analyst suggests to use the following model \[ y = \beta_0 + \beta_1 (x-4.5)^2 \] Using transformations, fit a simple linear regression model. Plot the scatter diagram and fitted curve (Note: it is not a straight line in this case). Compare \(MS_{res}\), \(R^2\) and the R-student residual plots with the model in part a.

  4. Construct the probability plot for part (c). Is there a problem with the normality assumption? If so determine the problem (heavy tailed, light tailed, or something else)

Answer:

pData <- read.table("https://math.dartmouth.edu/~m50f17/phytoplankton.csv", header=T, sep=",")
pop <- pData$pop
subs1 <- pData$subs1
subs2 <- pData$subs2


fitted=lm(pop~subs2)
plot (subs2, pop)
abline(fitted$coefficients)