F Test for Lack of Fit

In [1]:
# read the data 
from read_data_fx import read_tb_dataTA
data=read_tb_dataTA(3,4)
X=data[:,0]
Y=data[:,1]
In [2]:
import numpy as np
import pandas as pd
n=len(X)

df = pd.DataFrame({'Branch': np.arange(1,n+1),'Size of minimum deposit':X[:],'Number of new Accounts':Y[:]})
df=df.reindex_axis(['Branch','Size of minimum deposit','Number of new Accounts'], axis=1)
df.set_index('Branch')
Out[2]:
Size of minimum deposit Number of new Accounts
Branch
1 125 160
2 100 112
3 200 124
4 75 28
5 150 152
6 175 156
7 75 42
8 175 124
9 125 150
10 200 104
11 100 136
In [3]:
def lin_reg(X,Y):
    barX=np.mean(X); barY=np.mean(Y)
    XminusbarX=X-barX; YminusbarY=Y-barY
    b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
    b0=barY-b1*barX
    Yhat=b0+b1*X
    e_i=Y-Yhat
    sse=np.sum(e_i**2)
    ssr=np.sum((Yhat-barY )**2)
    
    return {'Yhat':Yhat,'b0':b0,'b1':b1,'e_i': e_i,'sse':sse,'ssr':ssr}

def disp_lin_eq(X,Y):
    linreg=lin_reg(X,Y)   
    b0=linreg['b0'];b1=linreg['b1']
    b0d = format(b0,'.2f'); b1d=format(b1, '.2f')
    from IPython.display import display, Math, Latex
    display(Math(r'\hat{Y}='+str(b0d)+'+'+str(b1d)+r'X'))
In [4]:
disp_lin_eq(X,Y)
$$\hat{Y}=50.72+0.49X$$
In [5]:
import matplotlib.pyplot as plt 
%matplotlib inline 
linreg=lin_reg(X,Y)
b0=linreg['b0'];b1=linreg['b1'];

plt.figure(figsize=(10,8))
plt.plot(X,Y,'ro',ms=20) 
plt.plot(X,b0+b1*X,'b-',lw=5);
plt.xlim(55,205);
plt.ylim(10,175);
plt.xlabel('Size of minimum deposit',fontsize=15);
plt.ylabel('Number of new Accounts',fontsize=15);

Step 1

Unique levels of $X$ or replicates of $X$.

Notation:

$j$ represents jth unique level $X$

$c$ is the total number of unique levels of $X$

$i$ represents the repeated values/replicas for a particular level of $X$

$n$ is the total number of data points

$n_j$ is the total number of repeated values/replicas for a particular level $j$ of $X$

In [6]:
X1=np.unique(X) 
Yij=[]
for i in range(0,len(X1)): 
    yij=np.int32(Y[X==X1[i]])
    Yij.append(yij)
barYj=[]; 
for i in range (0,len(X1)):
    y0=np.mean(Yij[i])
    barYj.append(y0)
   
    
df = pd.DataFrame({'$j$':np.arange(1,len(X1)+1), '$X_j$' : X1[:],'$Y_{ij}$': Yij[:],'$\overline{Y_j}$': barYj[:]})
df.set_index('$j$')
Out[6]:
$X_j$ $Y_{ij}$ $\overline{Y_j}$
$j$
1 75 [28, 42] 35
2 100 [112, 136] 124
3 125 [160, 150] 155
4 150 [152] 152
5 175 [156, 124] 140
6 200 [124, 104] 114

Step 2

Full Model:

$Y_{ij}=\mu_j+\epsilon_{ij}$

  • $\mu_j$ are parameters $j=1,2,...,c$ and $\epsilon_{ij} \sim _{indp}N(0,\sigma^2) $
  • $E\{Y_{ij} \}=\mu_j$
  • Maximum likelihood estimate of $\mu_j$ is $\hat\mu_j$ and $\hat\mu_j=\overline{Y_j}$

Pure error sum of squares $SSPE$ is equal to $SSE_F$

Where $SSE_F=\sum_j\sum_i(Y_{ij}-\overline{Y_j})^2$, i.e.,

$$SSPE=\sum_j\sum_i(Y_{ij}-\overline{Y_j})^2 \tag{1} $$

Degrees of freedom for the full model $df_F=n-c$

Pure error mean square is given by

$$MSSPE=\frac{SSPE}{n-c} \tag{2}$$
In [7]:
# calculate msspe
yijminusbarYj=[];i=0  
for yij in Yij: 
    yijminusbarYj_temp=(yij-barYj[i])**2
    yijminusbarYj.append(yijminusbarYj_temp)
    i=i+1
sspe=0
for i in range(0,len(X1)):
    sspe1=np.sum(yijminusbarYj[i])
    sspe=sspe+sspe1
    


n=len(X); c=len(X1)
msspe=sspe/(n-c)

Step 3

Reduced model:

$Y_{ij}=\mu_j+\epsilon_{ij}$ but with constraint $\mu_j=\beta_0+\beta_1X_j$

General linear test approach

$H_0: E\{Y\}=\beta_0+\beta_1X$

$H_a: E\{Y\} \neq \beta_0+\beta_1X$

$SSE_R$ is equal to $SSE$ i.e., $SSE_R=\sum_j\sum_i(Y_{ij}-\hat{Y}_{ij})^2$

$$SSE=\sum_j\sum_i(Y_{ij}-\hat{Y}_{ij})^2 \tag{3} $$

Degrees of freedom for the reduced model $df_F=n-2$

Lack of fit sum of squares is

$$ SSLF=SSE-SSPE \tag{4}$$

Formula for $SSLF$ is

$$SSLF=\sum_j\sum_i(\overline{Y}_{j}-\hat{Y}_{ij})^2 \tag{5} $$

Degrees of freedom associate with $SSLF$ is $c-2$

Lack of fit mean square

$$MSSLF=\frac{SSLF}{c-2} \tag{6}$$
In [8]:
# calculate msslf 
sse=lin_reg(X,Y)['sse']
sslf=sse-sspe
mslf=sslf/(c-2.0)

Step 4

Test Statistics

$$F^*=\frac{MSLF}{MSPE} \tag{7}$$

If $F^* \leq F(1-\alpha; c-2,n-c)$ , conclude $H_0$

If $F^* \gt F(1-\alpha; c-2,n-c)$ , conclude $H_a$

In [9]:
from scipy import stats
alpha=0.01
Fstar=mslf/msspe
F=stats.f.ppf(1-alpha,c-2,n-c)


# fancy display of realtionships 
Fstar0=format(Fstar,'.3f');F0=format(F,'.3f');

from IPython.display import display, Math, Latex
display(Math(r'F^*='+str(Fstar0)))
display(Math(r'F='+str(F0)))
$$F^*=14.801$$
$$F=11.392$$

We observe $F^* \gt F$ therefore we conclude $H_a$. That is regression function is not linear.

$p-value = P\{F^* \lt F \}$

In [10]:
pval=1.0-stats.f.cdf(Fstar,c-2,n-c)
pval0=format(pval,'.3f')
display(Math(r'p-value='+str(pval0)))
$$p-value=0.006$$