# Solutions for Homework sheet 6¶

## Problem 1¶

### (a)¶

Alternatives for the tests are

$H_0: \rho_{12}=0$ i.e., $Y_1$ and $Y_2$ are independent

$H_a: \rho_{12} \neq 0$ i.e., $Y_1$ and $Y_2$ are not independent

Test statistics for the alternatives

$t^*=\frac{r_{12}\sqrt{n-2}}{\sqrt{1-r^2_{12}}}$

Decision rules are

If $|t^*| \leq t(1-\alpha/2;n-2)$, conclude $H_0$

If $|t^*| \gt t(1-\alpha/2;n-2)$, conclude $H_a$

In [1]:
import numpy as np
from scipy import stats

r12=0.61
n=84

tstar=(r12*np.sqrt(n-2))/np.sqrt(1-r12**2)
alpha=0.05
t=stats.t.ppf(1-alpha/2,n-2)

# fancy display of realtionships
tstar0=format(tstar,'.3f');t0=format(t,'.3f');

from IPython.display import display, Math, Latex
display(Math(r't^*='+str(tstar0)))
display(Math(r't='+str(t0)))

$$t^*=6.971$$
$$t=1.989$$

We observe $|t^*| \gt t(1-\alpha/2;n-2)$, therefore we conclude $H_a$ i.e., Y_1 and Y_2 are not independent

### (b)¶

Interval estimation of $\rho_{12}$ using the $z'$ transformation

Fisher transformation:

$z' = \frac{1}{2}\ln\frac{1+r_{12}}{1-r_{12}} = \tanh^{-1}(r_{12})$

$E(z) = \zeta = \frac{1}{2}\ln\frac{1+\rho_{12}}{1-\rho_{12}} = \tanh^{-1}(\rho_{12})$

$\sigma(z') = \frac{1}{\sqrt{n-3}}$

And the $1-\alpha$ interval for $\zeta$ is:

$z' \pm z(1-\alpha/2)\sigma(z')$

To get the intervals for $\rho_{12}$ we use the transformation

$\rho_{12}=\tanh{\zeta}$ and $r_{12}=\tanh{z'}$

In [2]:
z_prime = np.arctanh(r12)
z = stats.norm.ppf(1 - (alpha/2))
sigma_of_z = 1/np.sqrt(n - 3)

ub0=z_prime + z*sigma_of_z;lb0=z_prime - z*sigma_of_z;
r12_uplim = np.tanh(ub0);r12_lowlim = np.tanh(lb0)

r12_uplim0 = format(r12_uplim,'.3f'); r12_lowlim0=format(r12_lowlim, '.3f')
from IPython.display import display, Math, Latex
display(Latex(r'The $95\%$ confidence interval for $\rho_{12}$ is  ' ))
display(Math( '['+str(r12_lowlim0)+r' ~,~ '+str(r12_uplim0)+']'))

The $95\%$ confidence interval for $\rho_{12}$ is
$$[0.455 ~,~ 0.729]$$

### (c)¶

In [3]:
r12_uplim0 = format(r12_uplim**2,'.3f'); r12_lowlim0=format(r12_lowlim**2, '.3f')

from IPython.display import display, Math, Latex
display(Latex(r'The $95\%$ confidence interval for $\rho^2_{12}$ is  ' ))
display(Math( '['+str(r12_lowlim0)+r' ~,~ '+str(r12_uplim0)+']'))

The $95\%$ confidence interval for $\rho^2_{12}$ is
$$[0.207 ~,~ 0.532]$$

$\rho^2_{12}$ may be interpreted as proportionate reduction of total variance of one variable associated with use of the other variable. Here if we use the variable $Y_1$ (value of contract) then with 95% confidence we can explain between 20.7% to 53.2% of the variability in $Y_2$ (profit contribution generated by the contract). As in correlation model both variables are treated as symmetric therefore this statement will be true even if we interchange the variables.

## Problem 2¶

In [4]:
from read_data_fx import read_tb_data
# Percentage of individuals with high school diploma is X
Y1=data[:,1]
# Crime rate is Y
Y2=data[:,0];


### (a)¶

In [5]:
# Pearson product-moment correlation coefficient function

def pearson_corrX(Y1,Y2):
barY1=np.mean(Y1); barY2=np.mean(Y2)
Y1minusbarY1=Y1-barY1; Y2minusbarY2=Y2-barY2
r12=np.sum(Y1minusbarY1*Y2minusbarY2)/np.sqrt(np.sum(Y1minusbarY1**2)*np.sum(Y2minusbarY2**2))
return r12

In [6]:
r12=pearson_corrX(Y1,Y2)
r12_0 = format(r12,'.4f');

display(Latex(r'Pearson product-moment correlation coefficient'))
display(Math('r_{12}= '+str(r12_0)))

Pearson product-moment correlation coefficient
$$r_{12}= -0.4127$$

### (b)¶

Alternatives for the tests are

$H_0: \rho_{12} = 0$ (crime rate and percentage of individuals with high school diploma are independent)

$H_a: \rho_{12} \neq 0$ (crime rate and Percentage of individuals with high school diploma are not independent)

The test statistic

$t^* = \frac{r_{12}\sqrt{n-2}}{\sqrt{1-r^2_{12}}}$

Decision rules are

If $|t^*| \leq t(1-\alpha/2;n-2)$, conclude $H_0$

If $|t^*| > t(1-\alpha/2;n-2)$, conclude $H_a$

In [7]:
n=len(Y1)
tstar=(r12*np.sqrt(n-2))/np.sqrt(1-r12**2)
alpha=0.01
t=stats.t.ppf(1-alpha/2,n-2)

# fancy display of realtionships
tstar0=format(tstar,'.3f');t0=format(t,'.3f');

from IPython.display import display, Math, Latex
display(Math(r't^*='+str(tstar0)))
display(Math(r't='+str(t0)))

$$t^*=-4.103$$
$$t=2.637$$

We observe $|t^*| \gt t(1-\alpha/2;n-2)$, therefore we conclude $H_a$ i.e., crime rate and Percentage of individuals with high school diploma are not independent

## Problem 3¶

In [8]:
# read the data

#ACT test score (X)
X=data[:,1]
#GPA at the end of the freshman year (Y)
Y=data[:,0];


### (a)¶

In [9]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(6,6))
plt.boxplot(X, sym = 'ko',whis = 1.5)
plt.xticks([1],['ACT scores'],fontsize=18)
plt.yticks(range(10,40,5),fontsize=18)
plt.ylim([10,37]);


There are no noteworthy features in the box plot of ACT scores (predictor variable). It appears symmetric with no outliers.

### (b)¶

In [10]:
# linear regression function
def lin_reg(X,Y):
barX=np.mean(X); barY=np.mean(Y)
XminusbarX=X-barX; YminusbarY=Y-barY
b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0=barY-b1*barX
Yhat=b0+b1*X
e_i=Y-Yhat
sse=np.sum(e_i**2)
ssr=np.sum((Yhat-barY )**2)

return {'Yhat':Yhat,'b0':b0,'b1':b1,'e_i': e_i,'sse':sse,'ssr':ssr}

In [11]:
#residuals
e_i = lin_reg(X,Y)['e_i']

# binning the residuals
n, bins = np.histogram(e_i, 25)

# bin centers
bincenters = 0.5*(bins[1:]+bins[:-1])

# figure size
plt.figure(figsize=(8,8))

j=0
for i in n:
y=range(0,i); x=[bincenters[j]]*len(y); j=j+1
plt.plot(x,y,'ro',ms=12)

plt.ylim([-2,16])

plt.xticks(np.arange(-3.0,1.5,1.0),fontsize=18) ;
plt.yticks(np.arange(-2.0,16,2.0),fontsize=18) ;

plt.xlabel('$e_i$',fontsize=28);


Most residuals are small and their distribution resembles a normal distribution. There exists only few large residuals i.e., only few points in the data show larger departures from the linear regression model.

### (c)¶

In [12]:
Yhat = lin_reg(X,Y)['Yhat']

plt.figure(figsize=(10,10))

plt.axhline(y = 0, color = 'r', linewidth=5)
plt.plot(Yhat,e_i,'co',ms=12);

plt.xlabel('$\hat{Y_i}$',fontsize=28);
plt.ylabel('$e_i$',fontsize=28);
plt.yticks(np.arange(-3.0,3.0,1.0),fontsize=18) ;
plt.xticks(np.arange(2.7,3.5,0.3),fontsize=18) ;
plt.xlim([2.6,3.6])
plt.ylim([-3,3])

Out[12]:
(-3, 3)

Majority of the residual values are evenly scattered around the zero value indicating that linear model is indeed appropriate for the given data. We only also observe few outliers.

### (d)¶

In [13]:
def normal_prob_plot(e_i):
n=len(e_i);sse=np.sum(e_i**2)
mse_sqrt=np.sqrt(sse/(n-2.0))
from scipy import stats
kr=stats.rankdata(e_i)
e_ik=np.zeros(n)
e_ik=mse_sqrt*stats.norm.ppf((kr-0.375)/(n+0.25))
return e_ik

In [14]:
e_ik=normal_prob_plot(e_i)
plt.plot(e_ik,e_i,'ko',ms=10)

Yhat_e=lin_reg(e_ik,e_i)['Yhat']
plt.plot(e_ik,Yhat_e,'r-',lw=1);

plt.title('Normal probability plot',fontsize=15)
plt.xlabel('Expected',fontsize=15);
plt.ylabel('Residual',fontsize=15);

In [15]:
def pearson_corrX(Y1,Y2):
barY1=np.mean(Y1); barY2=np.mean(Y2)
Y1minusbarY1=Y1-barY1; Y2minusbarY2=Y2-barY2
r12=np.sum(Y1minusbarY1*Y2minusbarY2)/np.sqrt(np.sum(Y1minusbarY1**2)*np.sum(Y2minusbarY2**2))
return r12

r12=pearson_corrX(e_i,e_ik);
display(Math(r'\rho='+str(r12)))
print 'n=',len(X)

$$\rho=0.973727482928$$
n= 120


$H_0:$ Normal

$H_a:$ Not Normal

$\rho \geq \rho_c$ conclude H_0 else conclude $H_a$

$\rho_c$ is the critical values for coefficient of correlation between ordered residuals and expected values under normality when distribution of error terms is normal. TABLE B.6 IN TEXTBOOK OR “Use of the Correlation Coefficient with Normal Probability Plots,” The American Statistician, 75-79

$\rho_c>0.987$ for $n=120$ at $\alpha=0.05$

Here $\rho = 0.974$ i.e., $\rho \lt \rho_c$ therefore we conclude $H_a$ i.e., residuals are not normally distributed at $5\%$ significance level.

### (e)¶

Decision rule for Brown-Forsythe Test:

If $|t^*_{BF}| \leq t(1-\alpha/2;n-2)$, conclude the error variance is constant

If $|t^*_{BF}| > t(1-\alpha/2;n-2)$, conclude the error variance is not constant

Where test statistics is given by

$$t^*_{BF} = \frac{\bar{d_1}-\bar{d_2}}{s\sqrt{\frac{1}{n_1}-\frac{1}{n_2}}}$$
In [16]:
xx,idxn1=np.nonzero([X < 26]); xx,idxn2=np.nonzero([X >= 26])

n1=len(idxn1); n2=len(idxn2); n=len(X)

X1=X[idxn1];X2=X[idxn2]
Y1=Y[idxn1];X2=Y[idxn2]
e_i1=e_i[idxn1];e_i2=e_i[idxn2]

di1=abs(e_i1-np.median(e_i1))
di2=abs(e_i2-np.median(e_i2))
di1minusd1bar2=(di1-np.mean(di1))**2.0; di2minusd2bar2=(di2-np.mean(di2))**2;

s1=np.sum(di1minusd1bar2)+np.sum(di2minusd2bar2)

s=np.sqrt(s1/(n-2.0))

d12=np.mean(di1)-np.mean(di2);
tbfstar=d12/(s*np.sqrt(1.0/n1+1.0/n2))

alpha=0.05; t=stats.t.ppf(1-alpha/2.0,n-2);

In [17]:
alpha=0.01; t=stats.t.ppf(1-alpha/2.0,n-2);

t0=format(t,'.3f');

tbfstar0 = format(abs(tbfstar),'.4f')
display(Math(r'|t^*_{BF}|='+str(tbfstar0)))
display(Math(r't='+str(t0)))

$$|t^*_{BF}|=0.8967$$
$$t=2.618$$

We observe that $|t^*_{BF}| \leq t(1-\alpha/2;n-2)$, therefore we conclude that the error variance is constant.