In [1]:

```
# read the data
from read_data_fx import read_tb_data
data=read_tb_data(1,27)
# Age is X
X=data[:,1]
# Muscle mass is Y
Y=data[:,0];
# import the packages required
import numpy as np
import scipy as sc
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Carry out the linear regression
barX=np.mean(X); barY=np.mean(Y)
XminusbarX=X-barX; YminusbarY=Y-barY
b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0=barY-b1*barX
Yhat=b0+b1*X
e_i=Y-Yhat # Residuals
n=len(X); SSE=np.sum(e_i**2)
MSE=SSE/(n-2);
s=np.sqrt(MSE)
```

Test hypotheses/alternatives are:

$H_0 : \beta_1 \geq 0$ there is no negative relationship between amount of muscle mass and age.

$H_a : \beta_1 \lt 0$ there is a negative relationship between amount of muscle mass and age.

Test statistics is given by

$t^{*}=\frac{b_1}{s\{b_1\}}$

Where $s\{b_1\}=\sqrt{ \frac{MSE}{\sum{ {(X_i-\overline{X})}^2 } }}$

Decision rules are:

If $t^* \geq t(\alpha;n-2)$ then conclude $H_0$

If $t^* \lt t(\alpha;n-2)$ then conclude $H_a$

In [2]:

```
ssb1=np.sqrt(MSE/sum(XminusbarX**2))
tstar=b1/ssb1;
from scipy import stats
# calculating t value
t1=stats.t.ppf(0.05,n-2)
```

In [3]:

```
print 'tstar=',tstar, 'and', 't=', t1
```

We observe above that $t^* \lt t(1-\alpha;n-2)$, hence we conclude $H_a$

The P-value = $P\{t(n-2) \lt t^*\}$

In [4]:

```
pval=stats.t.cdf(tstar,n-2); print ' P-value = %4.4f' % pval
```

We observe P-value $=0.000$, we can conclude $H_a$ from this as well because P-value $< \alpha$

This conclusion in not appropriate for the following reasons:

$\beta_0$ corresponds to age 0 which is way outside the range of the predictor variable (40-79). Hence any conclusion for age 0 based on $\beta_0$ would not be very accurate.

The slope of the fitted line is negative, which will mean that $\beta_0$ would be large positive number (its estimate is $b_0= 156.3$). In other words female infants will have substantially more muscle mass than adult women. Which is physically unrealistic

In [5]:

```
t1=stats.t.ppf(1-0.05/2.0,n-2)
ttsb=t1*ssb1
ll=b1-ttsb; ul=b1+ttsb
print '95 percent confidence interval for beta_1'
print '%4.4f <= beta_1'% ll, '<= %4.4f' % ul
```

For normal error regression model, the sampling distribution of $b_1$ is ** normal ** with mean and variance:

$E\{b_1\}=\beta_1$

$\sigma^2\{b_1\}=\frac{\sigma^2}{\sum{(X_i-\overline{X})^2}}$

We know that $b_1$ can be written as linear combination of $Y_i$ i.e.,

$b_1=\sum{k_iY_i}$

Where $k_i=\frac{X_i-\overline{X}}{\sum{(X_i-\overline{X})^2}}$.

Also,

$\sum{k_i}=0 \tag{1}$

$\sum{k_iX_i}=1 \tag{2}$

$\sum{k_i^2}=\frac{1}{\sum{(X_i-\overline{X})^2} } \tag{3}$

- Derivation for $E\{b_1\}=\beta_1$:

$E\{b_1\}=E\{\sum{k_iY_i}\}=\sum{k_i E\{Y_i\}}=\sum{k_i(\beta_0+\beta_1X_i)}=\beta_0\sum{k_i}+\beta_1\sum{k_i}X_i \tag{4}$

Using (2) and (3) in the R.H.S of (4) we will get $E\{b_1\}=\beta_1$

- Derivation for $\sigma^2\{b_1\}=\frac{\sigma^2}{\sum{(X_i-\overline{X})^2}}$:

$\sigma^2\{b_1\}=\sigma^2\{\sum{k_iY_i}\}= \sum{k_i^2 \sigma^2 \{Y_i\} }= \sigma^2\sum{k_i^2} \tag{5} $

Substituting (3) in R.H.S of (5) we will get $\sigma^2\{b_1\}=\frac{\sigma^2}{\sum{(X_i-\overline{X})^2}}$

- Derivation for $s^2\{b_1\}$ (estimated variance of the sampling distribution of $b_1$):

We know $\sigma^2\{b_1\}=\frac{\sigma^2}{\sum{(X_i-\overline{X})^2}}$, to obtain $s^2\{b_1\}$ we replace $\sigma^2$ with MSE the unbiased estimator of $\sigma^2$. Therefore, we get

$s^2\{b_1\} =\frac{MSE}{\sum{(X_i-\overline{X})^2}}$