Math 50 Fall 2017, Homework #3

NOTE: For your homework download and use the template (https://math.dartmouth.edu/~m50f17/HW3.Rmd)

Read the green comments in the rmd file to see where your answers should go.

Question-1

Is maximum-likelihood estimator \(\tilde{\sigma}^2\) of \(\sigma^2\) an unbiased estimator ? Verify your answer. Comment on the change of the value of \[\mathbb{E}(\tilde{\sigma}^2) - {\sigma}^2\] as \(n\) goes to infinity.

Answer:

We know that

\[ \tilde{\sigma}^2 = \frac{ \sum_{i=1}^{n}(y_i - \tilde\beta_0 - \tilde\beta_1 x )^2}{n}\] and, moreover \(\tilde\beta_0=\hat\beta_0\) and \(\tilde\beta_1=\hat\beta_1\) since LSE and MLE has same coefficients estimations.

Combining these yield \[ \begin{aligned} \mathbb{E} [\tilde{\sigma}^2] & = \frac{ \mathbb{E} [ \sum_{i=1}^{n}(y_i - \tilde\beta_0 - \tilde\beta_1 x )^2 ] }{n} \\ &= \frac{ \mathbb{E} [ \sum_{i=1}^{n}(y_i - \hat\beta_0 - \hat\beta_1 x )^2 ] }{n} \\ &= \frac{ \mathbb{E} [ SS_{Res} ] }{n} \\ &= \frac{ \mathbb{E} [ (n-2) \hat\sigma^2 ] }{n} \\ \end{aligned} \] since \(\hat\sigma^2 = \frac{SS_{Res}}{n-2}\). We can now see

\[ \mathbb{E} [\tilde{\sigma}^2] = \frac{ (n-2) }{n}\mathbb{E} [ \hat\sigma^2 ] = \frac{ (n-2) }{n} \sigma^2 \]

since \(\hat\sigma^2\) is an unbiased estimator of \(\sigma^2\). We conclude that \(\tilde{\sigma}^2\) is a biased estimator with

\[\mathbb{E}(\tilde{\sigma}^2) - {\sigma}^2 = \left(\frac{ n-2 }{n} -1 \right) \sigma^2 = \frac{ -2 }{n} \sigma^2 \]

which goes to \(0\) as \(n \rightarrow\infty\). Therefore bias is negligible for large \(n\).

Question-2

Consider the shear strength vs age relation using the propellant data.

Recalculate the coefficients of the fitted linear regression model using the vector equations we obtained.
Suppose that the expectation of the initial shear strength is known to be 2400. Write the corresponding model (should involve only one parameter \(\beta_1\)). Calculate 95% CI on \(\beta_1\).

Answer:

# Computation part of the answer : 

prop<-read.table("https://math.dartmouth.edu/~m50f17/propellant.csv", header=T, sep=",")
shearS<-prop$ShearS 
age<-prop$Age

X <- cbind(1,age)
betaHat <- solve( t(X) %*% X ) %*% t(X) %*% shearS

# part (a)
cat ( " Part(a)
 Fitted Coefficients =  " , betaHat   , " \n\n")

##  Part(a)
##  Fitted Coefficients =   2627.822 -37.15359

# part (b)
# We can use a vertical shift and use the no-intercept model 
# Note that vertical shifting will not change the beta1 and its statistics. 
# Therefore after we shift down by 2400, we simply work on a no-intercept model

shearS_Shifted <- shearS - 2400

beta1Hat = sum( age * shearS_Shifted) / sum(age^2)
yHat <- beta1Hat * age 

# We use CI for no-intercept model
dof = 19 

MSres = sum((yHat - shearS_Shifted)^2) / dof
seBeta1 = sqrt(MSres / sum(age^2) )

tQ = qt( .975, dof)
lowerCI = beta1Hat - tQ * seBeta1
upperCI = beta1Hat + tQ * seBeta1

cat ( " Part (b) 
 Lower and upper bound of CI are :  " ,  lowerCI , ",  " , upperCI , " 
 \n Note that these numbers make sense. Since we forced intercept to be 2400 (which is 
 lower than the estimate of the intercept-model) the regression analysis response is 
 to raise the other end, and therefore increase the slope. "
 )

##  Part (b) 
##  Lower and upper bound of CI are :   -28.64285 ,   -19.63201  
##  
##  Note that these numbers make sense. Since we forced intercept to be 2400 (which is 
##  lower than the estimate of the intercept-model) the regression analysis response is 
##  to raise the other end, and therefore increase the slope.

fitted = lm(shearS ~ age)
plot(age, shearS, main="Shear Strength vs. Age", pch=16, xlim=c(0,30), ylim=c(1600,2700))
abline(fitted$coef, lwd = 2, col = "blue")
abline( c(2400,beta1Hat), lwd = 2 , col = 'red')

Example : Phytoplankton Population

A scientist is trying to model the relation between phytoplankton population in the city public water supply and concentration of two substances. The sample data is at : https://math.dartmouth.edu/~m50f17/phytoplankton.csv

where headers are

pop : population of phytoplankton (\(y\))
subs1 : concentration of substance-1 (\(x_1\))
subs2 : concentration of substance-2 (\(x_2\))

Lets consider a guessed model
\[ y = 200 + 10x_1 -15x_2 \] Below is the corresponding code to plot the scatter diagram and the above plane.

# Note: Run the following in R console if you get errors in plotting or library loading : 
#  install.packages("scatterplot3d") 
#  install.packages("plot3D") 

library("plot3D")
library("scatterplot3d")

# Loading data 
pData <- read.table("https://math.dartmouth.edu/~m50f17/phytoplankton.csv", header=T, sep=",")
pop <- pData$pop
subs1 <- pData$subs1
subs2 <- pData$subs2

# Create a mesh
meshP <- mesh( seq(min(subs1),max(subs1),0.03)  , seq(min(subs2),max(subs2),0.03) )
x1Mesh <- meshP$x 
x2Mesh <- meshP$y 

myModel <- 200 + 10*x1Mesh  - 15 *x2Mesh   

# Below is the code to plot the scatter diagram with red markers and your model
# You need to set two variables before calling : 
# myModel : your model  
# PLOTTING # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
sc1 <- scatterplot3d(subs2,subs1,pop, pch=17 , type = 'p', angle = 15 , highlight.3d = T ) 
sc1$points3d (x2Mesh,x1Mesh, myModel, cex=.02, col="blue")

# You can also change the view angle to visually test the model 
# PLOTTING # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
sc1 <- scatterplot3d(subs2,subs1,pop, pch=17 , type = 'p', angle = 125 , highlight.3d = T ) 
sc1$points3d (x2Mesh,x1Mesh, myModel, cex=.02, col="blue")

Question-3

For the phytoplankton population data use regression model of the form \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon . \] Fit a multiple linear regression model (using (3.13) and hat matrix H). Plot scatter diagram and fitted plane. Print predicted coefficients. Calculate \(R^2\) and adjusted \(R^2\).

Answer:

# Computation part of the answer : 

X <- cbind(1,subs1, subs2)
betaHat <- solve( t(X) %*% X ) %*% t(X) %*% pop 
H <- X %*% solve( t(X) %*% X ) %*% t(X)
yHat <- H %*% pop 

myModel <- betaHat[1] + betaHat[2]*x1Mesh + betaHat[3]*x2Mesh

sc1 <- scatterplot3d(subs2,subs1,pop, pch=17 , type = 'p', angle = 15 , highlight.3d = T ) 
sc1$points3d (x2Mesh,x1Mesh, myModel, cex=.02, col="blue")

SSt <- sum( (pop - mean(pop))^2 )
SSres <- sum( (pop - yHat)^2 );

n = length(pop)  
k = 2 

R2 <- 1 - (SSres / SSt)
R2Adj <-  1 - ( (SSres/(n -k-1)) / (SSt/(n-1)) )


cat ( " Fitted Coefficients =  " , betaHat , " 
 R2 = ", R2 , "
 R2 Adjusted : " , R2Adj)

##  Fitted Coefficients =   182.3174 11.92368 -12.64275  
##  R2 =  0.3609532 
##  R2 Adjusted :  0.3522587

Question-4

This time use a more general model for the phytoplankton population data ;
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{11} x_{1}^2 + \beta_{22} x_2^2 + \beta_{12} x_1 x_2 + \varepsilon. \] Fit a multiple linear regression model (using (3.13) and hat matrix H). Plot scatter diagram and fitted surface. Print predicted coefficients. Calculate \(R^2\) and adjusted \(R^2\). Compare with the previous model.

Answer:

# Computation part of the answer : 

# We add following regressors to our model : 
x3 = subs1^2 
x4 = subs2^2 
x5 = subs1 * subs2 


X <- cbind(1,subs1, subs2, x3, x4, x5)
betaHat <- solve( t(X) %*% X ) %*% t(X) %*% pop 
H <- X %*% solve( t(X) %*% X ) %*% t(X)
yHat <- H %*% pop 

myModel <- betaHat[1] + betaHat[2]*x1Mesh + betaHat[3]*x2Mesh + betaHat[4]* (x1Mesh^2) + betaHat[5]* (x2Mesh^2) + betaHat[6]*(x1Mesh*x2Mesh)

sc1 <- scatterplot3d(subs2,subs1,pop, pch=17 , type = 'p', angle = 15 , highlight.3d = T ) 
sc1$points3d (x2Mesh,x1Mesh, myModel, cex=.02, col="blue")

SSt <- sum( (pop - mean(pop))^2 )
SSres <- sum( (pop - yHat)^2 );

n = length(pop)  
k = 5

R2 <- 1 - (SSres / SSt)
R2Adj <-  1 - ( (SSres/(n -k-1)) / (SSt/(n-1)) )


cat ( " We observe visually that this is a much better fit. This is also confirmed by the significan improvement in R2 and adjusted R2 values.  
 Fitted Coefficients =  " , betaHat , " 
 R2 = ", R2 , "
 R2 Adjusted : " , R2Adj)

##  We observe visually that this is a much better fit. This is also confirmed by the significan improvement in R2 and adjusted R2 values.  
##  Fitted Coefficients =   -262.3596 62.80095 170.4656 -30.65678 -19.38499 3.688058  
##  R2 =  0.9431728 
##  R2 Adjusted :  0.9411997