# Solutions for Homework sheet2¶

## Problem 1¶

### Part (a)¶

In [1]:
# read the data
# Age is X
X=data[:,1]
# Muscle mass is Y
Y=data[:,0];

In [2]:
# import the packages required
import numpy as np
import scipy as sc
import statsmodels.api as sm
import matplotlib.pyplot as plt

barX=np.mean(X); barY=np.mean(Y)
XminusbarX=X-barX; YminusbarY=Y-barY
b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0=barY-b1*barX
Yhat=b0+b1*X
e_i=Y-Yhat # Residuals
sum_of_residuals = np.sum(e_i)
sum_of_squares_of_residuals = np.sum(e_i**2)

print 'check all the six properties for the fitted line'

print 'sum of residuals = %4.4f' % sum_of_residuals, ' ---property 1'
print 'sum of squares of residuals %4.4f' %sum_of_squares_of_residuals, '  ---property 2'
print 'sum of Y_i %4.4f' % np.sum(Y), '  ---property 3'
print 'sum of Yhat %4.4f' % np.sum(Yhat), '  ---property 3'
print 'sum of X_ie_i %4.4f' % np.sum(X*e_i), '  ---property 4'
print 'sum of Yhat e_i %4.4f' % np.sum(Yhat*e_i), '   ---property 5'

%matplotlib inline
plt.figure(figsize=(20,10))
# scatter plot
plt.scatter(X,Y,c='g',s=220,marker='s')
#make matplotlib inline
# xlabel of the scatter plot
plt.xlabel('Age',fontsize=25)
# ylabel of the scatter plot
plt.ylabel('Muscle Mass',fontsize=25)
# title of the scatter plot
plt.title('Age Vs. Muscle Mass',fontsize=25)
# regression line
# point barX and barY
plt.plot(barX,barY,marker='*',markersize=70,color='b')
#regression line
plt.plot(X,Yhat,'r-',linewidth=2)
print 'Blue point repersents the point xbar and ybar -- property 6'

check all the six properties for the fitted line
sum of residuals = 0.0000  ---property 1
sum of squares of residuals 3874.4475   ---property 2
sum of Y_i 5098.0000   ---property 3
sum of Yhat 5098.0000   ---property 3
sum of X_ie_i 0.0000   ---property 4
sum of Yhat e_i 0.0000    ---property 5
Blue point repersents the point xbar and ybar -- property 6

##### Estimated parameters for the line¶
In [3]:
print 'b0=%4.5f' % b0, 'and', 'b1=%4.5f' % b1

b0=156.34656 and b1=-1.19000

##### Does a linear regression function appear to give a good fit here? Does your plot support the anticipation that muscle mass decreases with age?¶

Linear regression function does appear to be good fit. We observe a linear decrease in muscle mass with age. We obtain a negative value for $b_1$, which suports that with age muscle mass decreases.

### Part b¶

##### (1) A point estimate of the difference in the mean muscle mass for women differing in age by one year.¶

We know $\hat{Y}_i= b_0+b_1{X}_i$ and $\hat{Y}_{i+1}-\hat{Y}_i = b_1({X}_{i+1}-X_i)$. We are given the age difference to be 1 year i.e., here ${X}_{i+1}-X_i=1$. There in the given case we will have $\hat{Y}_{i+1}-\hat{Y}_i = b_1$. Therefore, the difference in mean muscle mass for women differing in age by one year is $b_1=âˆ’1.19$

##### (2) Point estimate of the mean muscle mass for women aged X=60 years.¶

This will be given by $b_0+b_1 60$

In [4]:
Yhat_60 = b0 + 60*b1
print 'Point estimate of the mean muscle mass for women aged X=60 years is %4.2f' %Yhat_60

Point estimate of the mean muscle mass for women aged X=60 years is 84.95

##### (3) The value of the residual for the eighth case¶
In [5]:
print 'value of the residual for the eighth case=%4.3f' %e_i[7]

value of the residual for the eighth case=4.443

##### (4) A point estimate of $\sigma^2$¶

Point estimate for $\sigma^2$ is given by

$s^2=MSE=\frac{SSE}{n-2}=\frac{\sum{(Y_i-\hat{Y_i})^2}}{n-2}$

In [6]:
n=len(X); SSE=sum_of_squares_of_residuals; MSE=sum_of_squares_of_residuals/(n-2);
print 'point estimate of \sigma^2 =%4.1f' %MSE

point estimate of \sigma^2 =66.8


## Problem 2¶

In [7]:
#read the data file
with open('temp1.txt','r') as f:
f.close()

# Removing '\n' and splitting a line in columns
linesX = lines[1].strip('\n')
line = linesX.split('\t')

# Copy data into a numpy arary

line=[];xdata = np.zeros(len(lines)-1,dtype={'names':['temp', 'lat', 'long'], 'formats':['i4', 'f8', 'f8']})
for j in range(1,len(lines)):
line=lines[j].strip('\n');line=line.split('\t');
xdata[j-1]=(line[1],line[2],line[3])

# latitude is X (predictor variable), temprature is Y(response variable) and Z is longitude (color)
X = xdata['lat']; Y = xdata['temp']; Z= xdata['long']

In [8]:
# Estimate the regression function

barX = np.mean(X); barY = np.mean(Y)
XminusbarX = X - barX; YminusbarY = Y - barY
b1 = sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0 = barY - b1*barX
Yhat=b0+b1*X

print 'b0=%4.5f' % b0, 'and', 'b1=%4.5f' % b1

b0=108.72774 and b1=-2.10959

###### c. Show that your fitted line goes through the point $\overline{X}$,$\overline{Y}$¶
In [9]:
plt.figure(figsize=(20,10))
plt.scatter(X,Y,c =Z,s = 150,cmap='winter')

plt.xlabel('Latitude', fontsize=25)
plt.ylabel('Temperature',fontsize=25)
plt.title('Regression of latitude vs temperature',fontsize=25)
cb=plt.colorbar()
cb.set_label('Longitude',fontsize=25)
# point barX and barY
#regression line
plt.plot(X,Yhat,'k-',linewidth=2)
plt.plot(barX,barY,marker='*',markersize=40,color='r')

Out[9]:
[<matplotlib.lines.Line2D at 0x10f5de590>]
##### b. Plot distribution of residuals¶
In [10]:
# residulas
e_i= Y - Yhat

# Distribution of residulas

n, bins, patches=plt.hist(e_i,100,normed='true',color='cyan')
plt.xlabel('Residuals',fontsize=25)
plt.ylabel('Frequency',fontsize=25)

Out[10]:
<matplotlib.text.Text at 0x10f98aa50>

## Problem 3¶

### Part (a)¶

Derivation of $b_0$ and $b_1$:

To find the estimate of the regression parameter $\beta_0$ and $\beta_1$ we use the method of least squares. Where we minimize the quantity:

$$Q=\sum_{i=1}^{n} (Y_i-\beta_0-\beta_1X_i)^2$$

Let say the minimum of $Q$ for $\beta_0$ and $\beta_1$ occurs at $b_0$ and $b_1$ respectively then

$\frac{\partial{Q}}{\partial{\beta_0}} \bigg|_{\beta_0=b_0} =0$ and $\frac{\partial{Q}}{\partial{\beta_1}} \bigg|_{\beta_1=b_1} =0$ and we know that

$$\frac{\partial{Q}}{\partial{\beta_0}}=-2\sum(Y_i-\beta_0-\beta_1X_i)$$

$$\frac{\partial{Q}}{\partial{\beta_1}}=-2\sum X_i(Y_i-\beta_0-\beta_1X_i).$$

Therefore,

-2\sum(Y_i-b_0-b_1X_i)=0
\tag{1}

-2\sum X_i(Y_i-b_0-b_1X_i)=0
\tag{2}

Above two equations can be simplified into

###### normal equations¶

\sum Y_i= nb_0+b_1\sum X_i
\tag{3}

\sum X_iY_i=b_0\sum Xi+b_1\sum {X_i}^2
\tag{4}

From equation (3) we obtain

b_0=\frac{1}{n}\bigg(\sum Y_i-b_1\sum X_i\bigg)

That is

$$\boxed{b_0=\overline{Y}-b_1\overline{X}}$$

From equation (4) on subsituting $b_0$ we obtain

$b_1=\frac{\sum{X_i}{Y_i}- \overline{Y}\sum{X_i}}{\sum {X_i}^2-\overline{X} \sum{X}_i} = \frac{\sum{X_i}{Y_i}-n\overline{X}\overline{Y}}{\sum {X_i}^2-n\overline{X}^2} = \frac{\sum({X_i}{Y_i}-X_i\overline{Y})+ \sum\overline{X}\overline{Y} -\sum Y_i\overline{X} }{\sum ({X_i}^2-X_i\overline{X})+ \sum\overline{X}^2-\sum X_i\overline{X} }$

This expression can be further simplified into

$$\boxed{b_1=\frac{\sum(X_i-\overline{X})(Y_i-\overline{Y})}{\sum(X_i-\overline{X})^2} }$$

#### Residuals¶

Residuals are defined as $e_i=Y_i-\hat{Y_i}$, where $\hat{Y_i}=b_0+b_1X_i$.

##### Property 1¶

$$\sum{e_i}=0$$

Proof:

$\sum{e_i}=\sum(Y_i-\hat{Y_i})=\sum(Y_i-b_0-b_1X_i)=n\overline{Y}-nb_0-nb_1\overline{X}=n\overline{Y}-n(\overline{Y}-b_1\overline{X})-nb_1\overline{X}=0$

##### Property 2¶

$\sum{e_i}^2$ is minimum

Proof:

$b_0$ and $b_1$ are those values of $\beta_0$ and $\beta_1$ respectively that minimizes $\sum_{i=1}^{n} (Y_i-\beta_0-\beta_1X_i)^2$. Therefore, $\sum_{i=1}^{n} (Y_i-b_0-b_1X_i)^2$ is minimum i.e., $\sum{e_i}^2$ is minimum.

##### Property 3¶

$$\sum Y_i=\sum \hat{Y}_i$$

Proof:

From property 2 we know $\sum{e_i}=0$ or $\sum{(Y_i-\hat{Y_i})}=0$ implying that

$\sum Y_i=\sum \hat{Y}_i$

##### Property 4¶

$$\sum{X_ie_i}=0$$

Proof:

$\sum{X_ie_i}=\sum{X_i(Y_i-\hat{Y_i})}= \sum{X_i (Y_i-b_0-b_1X_i)}$

We see from Eq. (2) above that $\sum{X_i (Y_i-b_0-b_1X_i)}=0$, therefore we can conclude that $\sum{X_ie_i}=0$

##### Property 5¶

$$\sum{\hat{Y_i}e_i}=0$$

Proof:

$\sum{\hat{Y_i}e_i}=\sum{(b_0+b_1X_i)e_i } =b_0\sum {e_i} + b_1 \sum {X_ie_i}$

From property 1 and property 4 we know that $\sum {e_i}=0$ and $\sum {X_ie_i}=0$, hence

$\sum{\hat{Y_i}e_i}=0$

##### Property 6¶

The regression line always goes through the point $(\overline{X},\overline{Y})$

Proof:

$\hat{Y_i}=b_0+b_1X_i$, after substituting for $b_0$ then we have $\hat{Y_i}=\overline{Y}-b_1 \overline{X} +b_1X_i$ i.e.,

$\hat{Y_i}=\overline{Y}+b_1(X_i-\overline{X})$

This equation is satisfied for $(\overline{X},\overline{Y})$, therefore we can conclude that regression $\hat{Y_i}=b_0+b_1X_i$ always goes through $(\overline{X},\overline{Y})$.

### Part (b)¶

The normal error regression model is as follows:

$$Y_i=\beta_0+\beta_1X_i+\epsilon_i$$

Where $Y_i$ is the observed response in the $i$th trial

$X_i$ is the level of predictor variable in the $i$th trial

$\beta_0$ and $\beta_1$ are parameters

$\epsilon_i$ are independent $N(0,\sigma^2)$

$i=1,\dots,n$

Maximum likelihood estimator for $\sigma^2$ is

$\hat{\sigma}^2 =\frac{\sum{(Y_i-\hat{Y_i})^2}}{n}$ and $\hat{\sigma}^2$ is a biased estimator with n degrees of freedom.

Least squares estimator for $\sigma^2$ is

$s^2=\frac{\sum{(Y_i-\hat{Y_i})^2}}{n-2}$ and $s^2$ is an unbiased estimator with n-2 degrees of freedom.