NOTE: For your homework download and use the template (https://math.dartmouth.edu/~m50f17/FinalProject.Rmd)
Read the green comments in the rmd file to see where your answers should go.
The Boston
data frame has 506 rows and 13 columns. “https://math.dartmouth.edu/~m50f17/Boston.csv”
This data frame contains the following columns:
crim
per capita crime rate by town.
zn
proportion of residential land zoned for lots over 25,000 sq.ft.
indus
proportion of non-retail business acres per town.
chas
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox
nitrogen oxides concentration (parts per 10 million).
rm
average number of rooms per dwelling.
age
proportion of owner-occupied units built prior to 1940.
dis
weighted mean of distances to five Boston employment centres.
rad
index of accessibility to radial highways.
tax
full-value property-tax rate per $10,000.
ptratio
pupil-teacher ratio by town.
lstat
lower status of the population (percent).
medv
median value of the house in $1000s.
Your goal is to develop a model that predicts median value of the house (medv). You start with the multiple linear regression model using all of the 12 regressors (this is your base model). Answer the below questions. In all parts write your model clearly. In addition to writing your justification clearly, print the critical values and display the plots you use.
Fit a multiple linear regression model using all 12 regressors (base model).
Give a model that uses base model and includes interaction terms crim\(\times\)age, rm\(\times\)tax, rm\(\times\)ptratio, tax\(\times\)ptratio, nox\(\times\)crim, nox\(\times\)age and 3 additional interaction terms of your choice. Check if any of these interaction terms contribute to the model. Eliminate the rest of the interaction terms.
Try various transformations on the base model, then propose a transformation (on prediction variable medv or on regressors) that you think it might be helpful to linearize the model (or to improve it). Then fit a model using this transformation.
Eliminate 6 of the regressors from the base model, that (you think) are the least significant ones. (You can do a subjective choice, considering the nature of the data, as long as you support it. For example you can make a few joint significance test to support your choice). Now using the remaining 6 regressors propose a polynomial model that includes quadratic terms and interaction terms. Then fit this model.
Compare the performance of the models in part (a) to (d). Look at various diagnostics we have seen. Also check for normality and constant variance violations. Make a comparison and support your comments with plots and statistics.
Now considering all of the above, propose a new model different than the one in part a (try mixture of the suggestions above). Fit your model. Comment on overall adequacy of your model comparing with the ones above.
Using your model in (f) detect 3 points from the data which you think are most probably outliers but not influential points. Detect pure leverage points and influential points (if no such points then say not detected, if there are more then 3 then write the most significant 3). Calculate the R-Student residuals at the points you find in this part.
Check for multicollinearity in model part (a), part (d), and your model in part (f). Compare the differences in multicollinearity and discuss its possible causes.