Assumptions: Normal Error regression model
$Y_i=\beta_0+\beta_1 X_i+\epsilon_i$
Where $\epsilon_i \sim N(0,\sigma^2)$ and are independent. We will use for previous example on cars and distance covered to stop. Please see https://math.dartmouth.edu/~m50f15/leastsquares0.html. Upload the complete code (as shown below) for the least squares regression before proceeding.
# import the packages required
import numpy as np
import scipy as sc
import statsmodels.api as sm
import matplotlib.pyplot as plt
# import the data set using statsmodel.api
cars_speed=sm.datasets.get_rdataset("cars", "datasets")
X=np.array(cars_speed.data['speed'])
Y=np.array(cars_speed.data['dist'])
barX=np.mean(X); barY=np.mean(Y)
XminusbarX=X-barX; YminusbarY=Y-barY
b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0=barY-b1*barX
Yhat=b0+b1*X
e_i=Y-Yhat # Residuals
sum_of_residulas = np.sum(e_i)
sum_of_squares_of_residuals = np.sum(e_i**2)
print 'sum of residuals = %4.4f' % sum_of_residulas, ' ---property 1'
print 'sum of squares of residuals %4.4f' %sum_of_squares_of_residuals, ' ---property 2'
print 'sum of Y_i %4.4f' % np.sum(Y), ' ---property 3'
print 'sum of Yhat %4.4f' % np.sum(Yhat), ' ---property 3'
print 'sum of X_ie_i %4.4f' % np.sum(X*e_i), ' ---property 4'
print 'sum of Yhat e_i %4.4f' % np.sum(Yhat*e_i), ' ---property 5'
# scatter plot
plt.scatter(cars_speed.data['speed'],cars_speed.data['dist'],c='b',s=60)
plt.xlabel('Speed')
plt.ylabel('Distance to stop')
plt.title('Distance cars took to stop in 1920s')
#make matplotlib inline
%matplotlib inline
#scatter plot
plt.scatter(cars_speed.data['speed'],cars_speed.data['dist'],c='b',s=60)
# xlable of the scatter plot
plt.xlabel('Speed')
# ylabel of the scatter plot
plt.ylabel('Distance to stop')
# title of the scatter plot
plt.title('Distance cars took to stop in 1920s')
# regression line
# point barX and barY
plt.plot(barX,barY,marker='*',markersize=20,color='orange')
#regression line
plt.plot(X,Yhat,'r-',linewidth=2)
print ' --- propert 6 (Orange point repersents the point xbar and ybar)'
n=len(X); SSE=sum_of_squares_of_residuals; MSE=sum_of_squares_of_residuals/(n-2); s=np.sqrt(MSE)
We first need the estimate for the sample standard deviation of $\beta_1$. It is given as:
$s\{b_1\}=\sqrt{ \frac{MSE}{\sum{ {(X_i-\overline{X})}^2 } }}$
ssb1=np.sqrt(MSE/sum(XminusbarX**2))
$1-\alpha$ confidence interval for $\beta_1$ is given by:
$b_1 \pm t(1-\alpha/2;n-2)s\{b_1\}$.
Where $t(1-\alpha/2;n-2)$ denotes $(1-\alpha/2)100$ percentile of the $t$ distribution. We will set $\alpha=0.05$ i.e., we will attempt to find $95\%$ confidence interval.
# Packages for importing t-statistics
from scipy import stats
# calculating t value
t1=stats.t.ppf(0.975,n-2) # ppf stands for Percent point function
tssb=t1*ssb1
print 'confidence interval for beta_1 is %4.4f' % b1, '+/-', '%4.4f' % tssb
llb0=b1-tssb; upl0=b1+tssb;lra=tssb/b1
print '%4.4f <= beta_1'% llb0, '<= %4.4f' % upl0, '; Ratio of the interval= %4.4f' % lra
Therefore, we can infer that $95\%$ confidence interval of $\beta_1$ to be:
$3.9324 - 0.8354 \leq \beta_1 \leq 3.9324 + 0.8354$
$3.0970 \leq \beta_1 \leq 4.7679$
Next we test for
$H_0: \beta_1=0$
$H_a; \beta_1 \neq 0$
If $|t^*| \leq t(1-\alpha/2;n-2)$ then conclude $H_0$
If $|t^*| \gt t(1-\alpha/2;n-2)$ then conclude $H_a$
where $t^*=\frac{b_1}{s\{b_1\}}$
tstar=b1/ssb1; print '|tstar|= %4.4f'% abs(tstar) ,'and', 't=%4.4f' % t1
We observe that $|t^*| \gt t(1-\alpha/2;n-2)$, therefore we conclude $H_a$
Two sided p-value is given by $2P\{t(n-2) \gt t^*\}$
pval=stats.t.sf(tstar,n-2); pval=2.0*pval; print 'Two sided p-value = %4.4f' % pval
As two sided p-value is lower than the significance level $\alpha=0.05$ we can conlude $H_a$ directly from this observation.
We test for if $\beta_1$ is positive
$H_0: \beta_1 \leq 0$
$H_a; \beta_1 \gt 0$
If $t^* \leq t(1-\alpha;n-2)$ then conclude $H_0$
If $t^* \gt t(1-\alpha;n-2)$ then conclude $H_a$
where $t^*=\frac{b_1}{s\{b_1\}}$
t1=stats.t.ppf(0.95,n-2)
tstar=b1/ssb1; print 'tstar= %4.4f'% tstar ,'and', 't=%4.4f' % t1
We observe that $t^* \gt t(1-\alpha;n-2)$, therefore we conclude $H_a$
One sided p-value is given by $P\{t(n-2) \gt t^*\}$
pval=stats.t.sf(tstar,n-2); print 'One sided p-value = %4.4f' % pval
As one sided p-value is lower than the significance level $\alpha=0.05$ we can conlude $H_a$ directly from this observation.
First we need to calculate sample varaine for $b_0$
$s\{b_0\}=\sqrt{MSE [\frac{1}{n}+\frac{\overline{X}^2}{\sum{ {(X_i-\overline{X})}^2 } } ]}$
$1-\alpha$ confidence intreval for $\beta_0$ is given by:
$b_0 \pm t(1-\alpha/2;n-2)s\{b_1\}$.
sb0=np.sqrt(MSE*(1.0/n+barX**2/sum(XminusbarX**2)))
t1=stats.t.ppf(0.975,n-2) # ppf stands for Percent point function
tssb=t1*sb0
print 'confidence interval for beta_0 is %4.4f' % b0, '+/-', '%4.4f' % tssb
Therefore, we can infer that $95\%$ confidence ineterval of $\beta_0$ to be:
$-17.5791 - 13.5888 \leq \beta_0 \leq -17.5791 + 13.5888$
Next we test for
$H_0: \beta_0=0$
$H_a; \beta_0 \neq 0$
If $|t^*| \leq t(1-\alpha/2;n-2)$ then conclude $H_0$
If $|t^*| \gt t(1-\alpha/2;n-2)$ then conclude $H_a$
where $t^*=\frac{b_0}{s\{b_0\}}$
tstar=b0/sb0
print '|tstar|=',abs(tstar)
print 't1=', stats.t.ppf(0.975,n-2)
We observe that $|t^*| \gt t(1-\alpha/2;n-2)$, therefore we conclude $H_a$
pval=stats.t.sf(abs(tstar),n-2); pval=2.0*pval; print 'Two sided p-value = %4.4f' % pval
As two sided p-value is lower than the significance level $\alpha=0.05$ we can conlude $H_a$ directly from this observation.