NOTE: For your homework download and use the template (

Read the green comments in the rmd file to see where your answers should go.

Note: Read the question carefully and write your answers briefly supporting your conclusions with plots and statistical quantities. You need to submit the html file and the Rmd file. Before submitting verify your html file that it includes all necessary plots and also check all necessary values are printed in the html. Note that the questions below are open ended. There is no fixed “correct” answer. Try to use various ideas and techniques we have seen during the class.

In this project you will use the below data for house values in the city of Boston.

The Boston data frame has 506 rows and 13 columns. “

This data frame contains the following columns:


per capita crime rate by town.


proportion of residential land zoned for lots over 25,000 sq.ft.


proportion of non-retail business acres per town.


Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).


nitrogen oxides concentration (parts per 10 million).


average number of rooms per dwelling.


proportion of owner-occupied units built prior to 1940.


weighted mean of distances to five Boston employment centres.


index of accessibility to radial highways.


full-value property-tax rate per $10,000.


pupil-teacher ratio by town.


lower status of the population (percent).


median value of the house in $1000s.

Your goal is to develop a model that predicts median value of the house (medv). You start with the multiple linear regression model using all of the 12 regressors (this is your base model). Answer the below questions. In all parts write your model clearly. In addition to writing your justification clearly, print the critical values and display the plots you use.

  1. Fit a multiple linear regression model using all 12 regressors (base model).

  2. Give a model that uses base model and includes interaction terms crim\(\times\)age, rm\(\times\)tax, rm\(\times\)ptratio, tax\(\times\)ptratio, nox\(\times\)crim, nox\(\times\)age and 3 additional interaction terms of your choice. Check if any of these interaction terms contribute to the model. Eliminate the rest of the interaction terms.

  3. Try various transformations on the base model, then propose a transformation (on prediction variable medv or on regressors) that you think it might be helpful to linearize the model (or to improve it). Then fit a model using this transformation.

  4. Eliminate 6 of the regressors from the base model, that (you think) are the least significant ones. (You can do a subjective choice, considering the nature of the data, as long as you support it. For example you can make a few joint significance test to support your choice). Now using the remaining 6 regressors propose a polynomial model that includes quadratic terms and interaction terms. Then fit this model.

  5. Compare the performance of the models in part (a) to (d). Look at various diagnostics we have seen. Also check for normality and constant variance violations. Make a comparison and support your comments with plots and statistics.

  6. Now considering all of the above, propose a new model different than the one in part a (try mixture of the suggestions above). Fit your model. Comment on overall adequacy of your model comparing with the ones above.

  7. Using your model in (f) detect 3 points from the data which you think are most probably outliers but not influential points. Detect pure leverage points and influential points (if no such points then say not detected, if there are more then 3 then write the most significant 3). Calculate the R-Student residuals at the points you find in this part.

  8. Check for multicollinearity in model part (a), part (d), and your model in part (f). Compare the differences in multicollinearity and discuss its possible causes.