This document has comments, solutions and marking criteria for the practical of the module MATH3714/5714M. I would expect submitted solutions to be shorter, and to have more discussion/explanations but less R code than these notes.

Task 1 (6 marks)

We start our analysis by loading the training data set into R:

train <- read.csv("https://teaching.seehuhn.de/data/housing/train.csv")

Model Fitting

Fitting a model which describes medianHouseValue as a function of medianIncome is straightforward:

m <- lm(medianHouseValue ~ medianIncome, data = train)
summary(m)
##
## Call:
## lm(formula = medianHouseValue ~ medianIncome, data = train)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -515117  -51344  -13978   36193  370777
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   43720.2     1919.1   22.78   <2e-16 ***
## medianIncome  40179.5      482.1   83.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74180 on 9836 degrees of freedom
## Multiple R-squared:  0.4139, Adjusted R-squared:  0.4139
## F-statistic:  6947 on 1 and 9836 DF,  p-value: < 2.2e-16

The resulting model is \[ y = 43{,}720.2 + 40{,}179.5 \, x + \varepsilon, \] where \(x\) is the median income for each area, \(y\) is the median house price for the area, and \(\varepsilon \sim \mathcal{N}(0, 74{,}180^2)\).

Regression Diagnostics

Regression diagnostics are discussed in sections 9 and 11 of the lecture notes. Here are some measures we have discussed (others are possible, too).

Coefficient of Multiple Determination

The \(R^2\) value states how much of the fluctuations in the output can be explained by the inputs, as a ratio between 0 and 1. Larger values indicate better model fit.

From the summary output above we see that our model has \(R^2 = 0.4139\). This means that the variance of the residuals is nearly \(60\%\) of the variance of the median house prices. Hopefully a better model can be found in task 2.

Residual plots

This is a plot of the residuals against the fitted values:

plot(fitted(m), resid(m), cex = .1)
abline(h = 0)

If the model fits well, the residuals in this plot should form a horizontal band of constant width, centred around 0. Here it seems that the variance of the residuals around \(\hat y = 20{,}000\) might be larger than near the boundaries, and possibly the mean of the residuals is negative for small fitted values. From the plot, model fit is not very good.

Q-Q plot

A Q-Q plot compares the empirical quantiles of the residuals to the quantiles of the fitted normal distribution. A straight line indicates good model fit.

qqnorm(resid(m), main = NULL)
qqline(resid(m))

From the plot it seems clear that the residuals are not normally distributed.

Confidence Interval

Methods for computing confidence intervals are discussed in section 7.1 of the lecture notes. Here we use the summary(m) output to get all required information, but different ways of computing confidence intervals can of course also be used.

summary(m)
##
## Call:
## lm(formula = medianHouseValue ~ medianIncome, data = train)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -515117  -51344  -13978   36193  370777
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   43720.2     1919.1   22.78   <2e-16 ***
## medianIncome  40179.5      482.1   83.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74180 on 9836 degrees of freedom
## Multiple R-squared:  0.4139, Adjusted R-squared:  0.4139
## F-statistic:  6947 on 1 and 9836 DF,  p-value: < 2.2e-16
alpha <- 0.05
n <- nrow(train)
p <- 1
t <- qt(1 - alpha/2, n - p - 1)
cat("[", 43720.2 - 1919.1*t, ", ", 43720.2 + 1919.1*t, "]\n", sep = "")
## [39958.37, 47482.03]

\[\phantom{+}\]

Marking Criteria

  • The data has been loaded correctly.
  • The resulting model is correct and explicitly stated (including the error variance).
  • There is some discussion of the model fit, including at least one of the measures discussed in the lecture notes.
  • The confidence interval is computed correctly.

\[\phantom{+}\]

Task 2 (12 marks)

This task is very open-ended, and I expect that different students will come up with different solutions. Here I will just give some ideas and examples.

Baseline Model

As a baseline, we fit the model which uses all input variables, without any transformations or interaction terms:

m0 <- lm(medianHouseValue ~ ., data = train)
summary(m0)
##
## Call:
## lm(formula = medianHouseValue ~ ., data = train)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -542334  -38926   -8820   29096  365582
##
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -3.345e+06  7.877e+04 -42.467  < 2e-16 ***
## longitude        -3.981e+04  8.983e+02 -44.312  < 2e-16 ***
## latitude         -3.934e+04  8.438e+02 -46.616  < 2e-16 ***
## housingMedianAge  9.895e+02  5.606e+01  17.650  < 2e-16 ***
## totalRooms       -7.969e+00  1.030e+00  -7.740 1.09e-14 ***
## totalBedrooms     8.901e+01  9.238e+00   9.635  < 2e-16 ***
## population       -3.693e+01  1.527e+00 -24.178  < 2e-16 ***
## households        6.672e+01  9.984e+00   6.683 2.47e-11 ***
## medianIncome      3.930e+04  5.332e+02  73.699  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60790 on 9829 degrees of freedom
## Multiple R-squared:  0.6067, Adjusted R-squared:  0.6064
## F-statistic:  1895 on 8 and 9829 DF,  p-value: < 2.2e-16

The adjusted \(R^2\) value is \(0.6064\), which is a significant improvement from the model in task 1 (where we got 0.4139). We also check the residual plot for this model:

plot(fitted(m0), resid(m0), cex = .1)
abline(h = 0)