# read the data
from read_data_fx import read_tb_data
data=read_tb_data(1,27)
# Age is X
X=data[:,1]
# Muscle mass is Y
Y=data[:,0];
# import the packages required
import numpy as np
import scipy as sc
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Carry out the linear regression
barX=np.mean(X); barY=np.mean(Y)
XminusbarX=X-barX; YminusbarY=Y-barY
b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0=barY-b1*barX
Yhat=b0+b1*X
e_i=Y-Yhat # Residuals
n=len(X); SSE=np.sum(e_i**2)
MSE=SSE/(n-2);
s=np.sqrt(MSE)
Test hypotheses/alternatives are:
$H_0 : \beta_1 \geq 0$ there is no negative relationship between amount of muscle mass and age.
$H_a : \beta_1 \lt 0$ there is a negative relationship between amount of muscle mass and age.
Test statistics is given by
$t^{*}=\frac{b_1}{s\{b_1\}}$
Where $s\{b_1\}=\sqrt{ \frac{MSE}{\sum{ {(X_i-\overline{X})}^2 } }}$
Decision rules are:
If $t^* \geq t(\alpha;n-2)$ then conclude $H_0$
If $t^* \lt t(\alpha;n-2)$ then conclude $H_a$
ssb1=np.sqrt(MSE/sum(XminusbarX**2))
tstar=b1/ssb1;
from scipy import stats
# calculating t value
t1=stats.t.ppf(0.05,n-2)
print 'tstar=',tstar, 'and', 't=', t1
We observe above that $t^* \lt t(1-\alpha;n-2)$, hence we conclude $H_a$
The P-value = $P\{t(n-2) \lt t^*\}$
pval=stats.t.cdf(tstar,n-2); print ' P-value = %4.4f' % pval
We observe P-value $=0.000$, we can conclude $H_a$ from this as well because P-value $< \alpha$
This conclusion in not appropriate for the following reasons:
$\beta_0$ corresponds to age 0 which is way outside the range of the predictor variable (40-79). Hence any conclusion for age 0 based on $\beta_0$ would not be very accurate.
The slope of the fitted line is negative, which will mean that $\beta_0$ would be large positive number (its estimate is $b_0= 156.3$). In other words female infants will have substantially more muscle mass than adult women. Which is physically unrealistic
t1=stats.t.ppf(1-0.05/2.0,n-2)
ttsb=t1*ssb1
ll=b1-ttsb; ul=b1+ttsb
print '95 percent confidence interval for beta_1'
print '%4.4f <= beta_1'% ll, '<= %4.4f' % ul
Here we are given that P-value is $0.91$. Two sided P-value=$2P\{t(n-2) \gt t^*\}$, it is the probability of obtaining $\beta_1$ equal to or more extreme than what was estimated, assuming that the null hypothesis is true. Which implies that the estimated slope of $-.18$ in this example is not statistically significant i.e., we will fail to reject the null hypothesis $H_0:\beta_1=0$. Hence, we can not make any worthwhile conclusion from the estimated parameters. Therefor the students message "The message I get here is that the more we spend on advertising this product, the fewer units we sell!" is wrong.
For normal error regression model, the sampling distribution of $b_1$ is normal with mean and variance:
$E\{b_1\}=\beta_1$
$\sigma^2\{b_1\}=\frac{\sigma^2}{\sum{(X_i-\overline{X})^2}}$
We know that $b_1$ can be written as linear combination of $Y_i$ i.e.,
$b_1=\sum{k_iY_i}$
Where $k_i=\frac{X_i-\overline{X}}{\sum{(X_i-\overline{X})^2}}$.
Also,
$\sum{k_i}=0 \tag{1}$
$\sum{k_iX_i}=1 \tag{2}$
$\sum{k_i^2}=\frac{1}{\sum{(X_i-\overline{X})^2} } \tag{3}$
$E\{b_1\}=E\{\sum{k_iY_i}\}=\sum{k_i E\{Y_i\}}=\sum{k_i(\beta_0+\beta_1X_i)}=\beta_0\sum{k_i}+\beta_1\sum{k_i}X_i \tag{4}$
Using (2) and (3) in the R.H.S of (4) we will get $E\{b_1\}=\beta_1$
$\sigma^2\{b_1\}=\sigma^2\{\sum{k_iY_i}\}= \sum{k_i^2 \sigma^2 \{Y_i\} }= \sigma^2\sum{k_i^2} \tag{5} $
Substituting (3) in R.H.S of (5) we will get $\sigma^2\{b_1\}=\frac{\sigma^2}{\sum{(X_i-\overline{X})^2}}$
We know $\sigma^2\{b_1\}=\frac{\sigma^2}{\sum{(X_i-\overline{X})^2}}$, to obtain $s^2\{b_1\}$ we replace $\sigma^2$ with MSE the unbiased estimator of $\sigma^2$. Therefore, we get
$s^2\{b_1\} =\frac{MSE}{\sum{(X_i-\overline{X})^2}}$