# Solutions exam 2¶

In [1]:
from read_data_fx import read_tb_data
data=read_tb_data(1,20)  # data from problem 1.20
Y=data[:,0] # total time spent in servicing (min)
X=data[:,1] # No. of copier serviced

In [2]:
import numpy as np
import matplotlib.pylab as plt
from scipy import stats as st

def lin_reg(X,Y):
barX=np.mean(X); barY=np.mean(Y)
XminusbarX=X-barX; YminusbarY=Y-barY
b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0=barY-b1*barX
Yhat=b0+b1*X
e_i=Y-Yhat
sse=np.sum(e_i**2)
ssr=np.sum((Yhat-barY )**2)

return {'Yhat':Yhat,'b0':b0,'b1':b1,'e_i': e_i,'sse':sse,'ssr':ssr}

In [3]:
# plott the fitted line and data
import matplotlib.pyplot as plt
%matplotlib inline
linreg=lin_reg(X,Y)
b0=linreg['b0'];b1=linreg['b1']; Yhat=linreg['Yhat'];

plt.figure(figsize=(10,8))
plt.plot(X,Y,'bo',ms=20)
plt.plot(X,Yhat ,'r-',lw=5);
plt.xlim(0.1,11); plt.ylim(0,170);
plt.xlabel('No. of copier serviced',fontsize=15);
plt.ylabel('Total time spent in servicing (min)   ',fontsize=15);


## (a)¶

We are given that mean required time should not increase by more than 14 minutes for each additional copier that is serviced on a service call. This implies we need to test the following alternatives:

$H_0: \beta_1 \leq 14$

$H_a: \beta_1 \gt 14$

The test statistics for this case will be

$t^{*}=\frac{b_1-14}{s\{b_1\}}$

Where $s\{b_1\}=\sqrt{ \frac{MSE}{\sum{ {(X_i-\overline{X})}^2 } }}$

Decision rules for t:

If $t^* \leq t(1-\alpha;n-2)$ then conclude $H_0$

If $t^* \gt t(1-\alpha;n-2)$ then conclude $H_a$

In [4]:
n=len(X); sse=linreg['sse']; XminusbarX=X-np.mean(X)
MSE=sse/np.float32(n-2)

ssb1=np.sqrt(MSE/sum(XminusbarX**2))
tstar=(b1-14.0)/ssb1;
from scipy import stats
# calculating t value
t1=stats.t.ppf(1-0.05,n-2)

# fancy display of realtionships
tstar0=format(tstar,'.3f');t0=format(t1,'.3f');

from IPython.display import display, Math, Latex
display(Math(r't^*='+str(tstar0)))
display(Math(r't='+str(t0)))

$$t^*=2.143$$
$$t=1.681$$

We observe that $t^* \gt t$ therefore we conclude $H_a$

The P-value = $P\{t(n-2) \gt t^*\}$

In [5]:
pval=1.0-stats.t.cdf(tstar,n-2);
pval0=format(pval,'.3f')
display(Math(r'p-value='+str(pval0)))

$$p-value=0.019$$

We observe p-value is less than the significance level of 0.05 hence we conclude $H_a$ directly

## (b)¶

In [6]:
barX=np.mean(X)
s_of_yh_hat=np.sqrt(MSE*(1.0/n+(X-barX)**2/sum(XminusbarX**2)))
W=np.sqrt(2.0*stats.f.ppf(0.90,2,n-2))
cb_upper=Yhat+W*s_of_yh_hat
cb_lower=Yhat-W*s_of_yh_hat

plt.figure(figsize=(10,8))
plt.plot(X,Y,'ko',ms=20)
plt.plot(X,Yhat ,'r-',lw=5);
plt.xlim(0.1,11); plt.ylim(0,170);
plt.xlabel('No. of copier serviced',fontsize=15);
plt.ylabel('Total time spent in servicing (min)   ',fontsize=15);

# X values are not sorted in this example, sort it to plot the confidence band properly
idx=np.argsort(X)

plt.plot(X[idx],cb_lower[idx],'b--',linewidth=5)
plt.plot(X[idx],cb_upper[idx],'b--',linewidth=5);

In [7]:
Xh=6.0
s_of_pred=np.sqrt(MSE*(1.0+1.0/n+(Xh-barX)**2/sum(XminusbarX**2)))
t1=stats.t.ppf(1-0.1/2.0,n-2)

Yhat_h=b0+b1*Xh

print Yhat_h,s_of_pred,t1

ub0=Yhat_h+t1*s_of_pred ;lb0=Yhat_h-t1*s_of_pred; ub0 = format(ub0,'.2f'); lb0=format(lb0, '.2f')

from IPython.display import display, Math, Latex

display(Math(str(lb0)+r' \leq \hat{Y}_h \leq'+str(ub0)))

89.6313315927 9.0222276439 1.68107070185

$$74.46 \leq \hat{Y}_h \leq104.80$$

## (c)¶

Breusch-Pagan test assumes that the variance of the term $\epsilon_i$ denoted by $\sigma_i^2$ is related to the level of $X$ in the following way:

$$\ln \sigma_i^2 = \gamma_0+\gamma_1X_i$$

Alternatives in this test are:

$H_0 : \gamma_1=0$ (Constant error variance)

$H_a: \gamma_1 \neq 0$ (Not Constant error variance)

Test statistics is

$\chi_{BP}^2=\frac{SSR^*}{2} \div \bigg(\frac{SSE}{n}\bigg)^2$; $SSR^*$ is regression sum of squares when regressing $e^2$ on $X$.

We know $\chi_{BP}^2 \sim \chi(1)$

Decision rules are:

$\chi_{BP}^2 \leq \chi^2(1-\alpha,1)$ conclude $H_0$

$\chi_{BP}^2 \gt \chi^2(1-\alpha,1)$ conclude $H_a$

In [8]:
# Breusch-Pagan Test
sse=lin_reg(X,Y)['sse'];e_i=lin_reg(X,Y)['e_i']

ssrstar=lin_reg(X,e_i**2)['ssr']
Xchi2=(ssrstar/2)/(sse/n)**2

Xchi02 = format(Xchi2,'.3f')
display(Math(r'\chi_{BP}^2='+str(Xchi02)))

chisq=stats.chi2.ppf(0.95,1)
chisq = format(chisq,'.3f')
display(Math(r'\chi^2='+str(chisq)))

$$\chi_{BP}^2=1.315$$
$$\chi^2=3.841$$

We observe that $\chi_{BP}^2 \lt \chi^2(1-\alpha,1)$, therefore we conclude $H_0$ i.e., error variance is constant.

## (d)¶

In [9]:
def normal_prob_plot(e_i):
n=len(e_i);sse=np.sum(e_i**2)
mse_sqrt=np.sqrt(sse/(n-2.0))
from scipy import stats
kr=stats.rankdata(e_i)
e_ik=np.zeros(n)
e_ik=mse_sqrt*stats.norm.ppf((kr-0.375)/(n+0.25))
return e_ik
def corrX(Y1,Y2):
barY1=np.mean(Y1); barY2=np.mean(Y2)
Y1minusbarY1=Y1-barY1; Y2minusbarY2=Y2-barY2
r12=np.sum(Y1minusbarY1*Y2minusbarY2)/np.sqrt(np.sum(Y1minusbarY1**2)*np.sum(Y2minusbarY2**2))
return r12

In [10]:
e_i=linreg['e_i']
e_ik=normal_prob_plot(e_i)
plt.plot(e_ik,e_i,'ko',ms=10)
plt.xlim(-22,22);
plt.ylim(-30,30);

linreg1=lin_reg(e_ik,e_i)
Yhat_e=linreg1['Yhat'];
plt.plot(e_ik,Yhat_e,'r-',lw=1)

plt.title('Normal probability plot',fontsize=15)
plt.xlabel('Expected',fontsize=15);
plt.ylabel('Residual',fontsize=15);

In [11]:
r12=corrX(e_i,e_ik);
display(Math(r'\rho='+str(r12)))

$$\rho=0.989207915242$$

$H_0:$ Normal

$H_a:$ Not Normal

$\rho \geq \rho_c$ conclude H_0 else conclude $H_a$

$\rho_c$ is the critical values for coefficient of correlation between ordered residuals and expected values under normality when distribution of error terms is normal. TABLE B.6 IN TEXTBOOK OR “Use of the Correlation Coefficient with Normal Probability Plots,” The American Statistician, 75-79

$\rho_c=0.974$ for $n=45$ at $\alpha=0.05$

We have $\rho \gt \rho_c$ therefore we conclude $H_0$

### (e)¶

Alternatives

$H_0: E\{Y\}=\beta_0+\beta_1X$

$H_a: E\{Y\} \neq \beta_0+\beta_1X$

Test Statistics

$F^*=\frac{MSLF}{MSPE}$

If $F^* \leq F(1-\alpha; c-2,n-c)$ , conclude $H_0$

If $F^* \gt F(1-\alpha; c-2,n-c)$ , conclude $H_a$

$p-value = P\{F^* \lt F \}$

In [12]:
import pandas as pd
X1=np.unique(X)
Yij=[]
for i in range(0,len(X1)):
yij=np.int32(Y[X==X1[i]])
Yij.append(yij)
barYj=[];
for i in range (0,len(X1)):
y0=np.mean(Yij[i])
barYj.append(y0)

df = pd.DataFrame({'$j$':np.arange(1,len(X1)+1), '$X_j$' : X1[:],'$Y_{ij}$': Yij[:],'$\overline{Y_j}$': barYj[:]})
df.set_index('$j$')

Out[12]:
$X_j$ $Y_{ij}$ $\overline{Y_j}$
$j$
1 1 [12, 4, 3, 27] 11.500000
2 2 [20, 41, 32, 18, 20, 28, 34, 27] 27.500000
3 3 [46, 36] 41.000000
4 4 [60, 72, 57, 57, 61] 61.400000
5 5 [68, 89, 66, 74, 73, 90, 86, 77] 77.875000
6 6 [93, 96] 94.500000
7 7 [105, 101, 109, 112, 111, 112] 108.333333
8 8 [100, 131, 123] 118.000000
9 9 [144, 134, 132, 131] 135.250000
10 10 [137, 156, 127] 140.000000
In [13]:
# calculate msspe
yijminusbarYj=[];i=0
for yij in Yij:
yijminusbarYj_temp=(yij-barYj[i])**2
yijminusbarYj.append(yijminusbarYj_temp)
i=i+1
yijminusbarYj=np.array(yijminusbarYj)

sspe=0
for i in range(0,len(X1)):
sspe1=np.sum(yijminusbarYj[i])
sspe=sspe+sspe1

n=len(X); c=len(X1)
msspe=sspe/(n-c)

# calculate msslf
sse=lin_reg(X,Y)['sse']
sslf=sse-sspe
mslf=sslf/(c-2.0)

alpha=0.05
Fstar=mslf/msspe
F=stats.f.ppf(1-alpha,c-2,n-c)

# fancy display of realtionships
Fstar0=format(Fstar,'.3f');F0=format(F,'.3f');

display(Math(r'F^*='+str(Fstar0)))
display(Math(r'F='+str(F0)))

$$F^*=0.968$$
$$F=2.217$$

We observe $F^* \lt F$ therefore we conclude $H_0$. That is regression function is not linear