Inference for $\beta_1$ and $\beta_0$¶

Assumptions: Normal Error regression model

$Y_i=\beta_0+\beta_1 X_i+\epsilon_i$

Where $\epsilon_i \sim N(0,\sigma^2)$ and are independent. We will use for previous example on cars and distance covered to stop. Please see https://math.dartmouth.edu/~m50f15/leastsquares0.html. Upload the complete code (as shown below) for the least squares regression before proceeding.

# import the packages required 
import numpy as np
import scipy as sc
import statsmodels.api as sm
import matplotlib.pyplot as plt 
# import the data set using statsmodel.api
cars_speed=sm.datasets.get_rdataset("cars", "datasets")
X=np.array(cars_speed.data['speed'])
Y=np.array(cars_speed.data['dist'])
barX=np.mean(X); barY=np.mean(Y)
XminusbarX=X-barX; YminusbarY=Y-barY
b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0=barY-b1*barX
Yhat=b0+b1*X
e_i=Y-Yhat # Residuals 
sum_of_residulas = np.sum(e_i) 
sum_of_squares_of_residuals = np.sum(e_i**2) 
print 'sum of residuals = %4.4f' % sum_of_residulas, ' ---property 1'
print 'sum of squares of residuals %4.4f' %sum_of_squares_of_residuals, '  ---property 2'
print 'sum of Y_i %4.4f' % np.sum(Y), '  ---property 3'
print 'sum of Yhat %4.4f' % np.sum(Yhat), '  ---property 3'
print 'sum of X_ie_i %4.4f' % np.sum(X*e_i), '  ---property 4'
print 'sum of Yhat e_i %4.4f' % np.sum(Yhat*e_i), '   ---property 5'
# scatter plot
plt.scatter(cars_speed.data['speed'],cars_speed.data['dist'],c='b',s=60)
plt.xlabel('Speed')
plt.ylabel('Distance to stop')
plt.title('Distance cars took to stop in 1920s') 
#make matplotlib inline 
%matplotlib inline 
#scatter plot 
plt.scatter(cars_speed.data['speed'],cars_speed.data['dist'],c='b',s=60)
# xlable of the scatter plot
plt.xlabel('Speed')
# ylabel of the scatter plot
plt.ylabel('Distance to stop')
# title of the scatter plot
plt.title('Distance cars took to stop in 1920s') 
# regression line
# point barX and barY
plt.plot(barX,barY,marker='*',markersize=20,color='orange')
#regression line
plt.plot(X,Yhat,'r-',linewidth=2)
print '  --- propert 6 (Orange point repersents the point xbar and ybar)'
n=len(X); SSE=sum_of_squares_of_residuals; MSE=sum_of_squares_of_residuals/(n-2); s=np.sqrt(MSE)

sum of residuals = 0.0000  ---property 1
sum of squares of residuals 11353.5211   ---property 2
sum of Y_i 2149.0000   ---property 3
sum of Yhat 2149.0000   ---property 3
sum of X_ie_i 0.0000   ---property 4
sum of Yhat e_i 0.0000    ---property 5
  --- propert 6 (Orange point repersents the point xbar and ybar)

$1-\alpha$ confidence intreval for $\beta_1$¶

We first need the estimate for the sample standard deviation of $\beta_1$. It is given as:

$s\{b_1\}=\sqrt{ \frac{MSE}{\sum{ {(X_i-\overline{X})}^2 } }}$

ssb1=np.sqrt(MSE/sum(XminusbarX**2))

$1-\alpha$ confidence interval for $\beta_1$ is given by:

$b_1 \pm t(1-\alpha/2;n-2)s\{b_1\}$.

Where $t(1-\alpha/2;n-2)$ denotes $(1-\alpha/2)100$ percentile of the $t$ distribution. We will set $\alpha=0.05$ i.e., we will attempt to find $95\%$ confidence interval.

# Packages for importing t-statistics 
from scipy import stats
# calculating t value
t1=stats.t.ppf(0.975,n-2) # ppf stands for Percent point function 

tssb=t1*ssb1

print 'confidence interval for beta_1 is %4.4f' % b1, '+/-', '%4.4f' % tssb 
llb0=b1-tssb; upl0=b1+tssb;lra=tssb/b1

print '%4.4f <= beta_1'% llb0,  '<= %4.4f' % upl0, '; Ratio of the interval= %4.4f' % lra

confidence interval for beta_1 is 3.9324 +/- 0.8354
3.0970 <= beta_1 <= 4.7679 ; Ratio of the interval= 0.2125

Therefore, we can infer that $95\%$ confidence interval of $\beta_1$ to be:

$3.9324 - 0.8354 \leq \beta_1 \leq 3.9324 + 0.8354$

$3.0970 \leq \beta_1 \leq 4.7679$

Two sided test for $\beta_1$¶

Next we test for

$H_0: \beta_1=0$

$H_a; \beta_1 \neq 0$

If $|t^*| \leq t(1-\alpha/2;n-2)$ then conclude $H_0$

If $|t^*| \gt t(1-\alpha/2;n-2)$ then conclude $H_a$

where $t^*=\frac{b_1}{s\{b_1\}}$

tstar=b1/ssb1; print '|tstar|= %4.4f'% abs(tstar) ,'and', 't=%4.4f' % t1

|tstar|= 9.4640 and t=2.0106

We observe that $|t^*| \gt t(1-\alpha/2;n-2)$, therefore we conclude $H_a$

Two sided p-value¶

Two sided p-value is given by $2P\{t(n-2) \gt t^*\}$

pval=stats.t.sf(tstar,n-2); pval=2.0*pval; print 'Two sided p-value = %4.4f' % pval

Two sided p-value = 0.0000

As two sided p-value is lower than the significance level $\alpha=0.05$ we can conlude $H_a$ directly from this observation.

One sided test for $\beta_1$¶

We test for if $\beta_1$ is positive

$H_0: \beta_1 \leq 0$

$H_a; \beta_1 \gt 0$

If $t^* \leq t(1-\alpha;n-2)$ then conclude $H_0$

If $t^* \gt t(1-\alpha;n-2)$ then conclude $H_a$

where $t^*=\frac{b_1}{s\{b_1\}}$

t1=stats.t.ppf(0.95,n-2)
tstar=b1/ssb1; print 'tstar= %4.4f'% tstar ,'and', 't=%4.4f' % t1

tstar= 9.4640 and t=1.6772

We observe that $t^* \gt t(1-\alpha;n-2)$, therefore we conclude $H_a$

One sided p-value¶

One sided p-value is given by $P\{t(n-2) \gt t^*\}$

pval=stats.t.sf(tstar,n-2);  print 'One sided p-value = %4.4f' % pval

One sided p-value = 0.0000

As one sided p-value is lower than the significance level $\alpha=0.05$ we can conlude $H_a$ directly from this observation.

$1-\alpha$ confidence intreval for $\beta_0$¶

First we need to calculate sample varaine for $b_0$

$s\{b_0\}=\sqrt{MSE [\frac{1}{n}+\frac{\overline{X}^2}{\sum{ {(X_i-\overline{X})}^2 } } ]}$

$1-\alpha$ confidence intreval for $\beta_0$ is given by:

$b_0 \pm t(1-\alpha/2;n-2)s\{b_1\}$.

sb0=np.sqrt(MSE*(1.0/n+barX**2/sum(XminusbarX**2)))

t1=stats.t.ppf(0.975,n-2) # ppf stands for Percent point function 

tssb=t1*sb0

print 'confidence interval for beta_0 is %4.4f' % b0, '+/-', '%4.4f' % tssb

confidence interval for beta_0 is -17.5791 +/- 13.5888

Therefore, we can infer that $95\%$ confidence ineterval of $\beta_0$ to be:

$-17.5791 - 13.5888 \leq \beta_0 \leq -17.5791 + 13.5888$

Two sided test for $\beta_0$¶

Next we test for

$H_0: \beta_0=0$

$H_a; \beta_0 \neq 0$

If $|t^*| \leq t(1-\alpha/2;n-2)$ then conclude $H_0$

If $|t^*| \gt t(1-\alpha/2;n-2)$ then conclude $H_a$

where $t^*=\frac{b_0}{s\{b_0\}}$

tstar=b0/sb0 
print '|tstar|=',abs(tstar)
print 't1=', stats.t.ppf(0.975,n-2)

|tstar|= 2.60105800302
t1= 2.0106347547

We observe that $|t^*| \gt t(1-\alpha/2;n-2)$, therefore we conclude $H_a$

pval=stats.t.sf(abs(tstar),n-2); pval=2.0*pval; print 'Two sided p-value = %4.4f' % pval

Two sided p-value = 0.0123

As two sided p-value is lower than the significance level $\alpha=0.05$ we can conlude $H_a$ directly from this observation.