This document has comments, solutions and marking criteria for the practical of the module MATH3714/5714M. I would expect submitted solutions to be shorter, and to have more discussion/explanations but less R code than these notes.

We start our analysis by loading the training data set into R:

`train <- read.csv("https://teaching.seehuhn.de/data/housing/train.csv")`

Fitting a model which describes `medianHouseValue`

as a
function of `medianIncome`

is straightforward:

```
m <- lm(medianHouseValue ~ medianIncome, data = train)
summary(m)
```

```
##
## Call:
## lm(formula = medianHouseValue ~ medianIncome, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -515117 -51344 -13978 36193 370777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43720.2 1919.1 22.78 <2e-16 ***
## medianIncome 40179.5 482.1 83.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74180 on 9836 degrees of freedom
## Multiple R-squared: 0.4139, Adjusted R-squared: 0.4139
## F-statistic: 6947 on 1 and 9836 DF, p-value: < 2.2e-16
```

The resulting model is \[ y = 43{,}720.2 + 40{,}179.5 \, x + \varepsilon, \] where \(x\) is the median income for each area, \(y\) is the median house price for the area, and \(\varepsilon \sim \mathcal{N}(0, 74{,}180^2)\).

Regression diagnostics are discussed in sections 9 and 11 of the lecture notes. Here are some measures we have discussed (others are possible, too).

The \(R^2\) value states how much of the fluctuations in the output can be explained by the inputs, as a ratio between 0 andÂ 1. Larger values indicate better model fit.

From the summary output above we see that our model has \(R^2 = 0.4139\). This means that the variance of the residuals is nearly \(60\%\) of the variance of the median house prices. Hopefully a better model can be found in task 2.

This is a plot of the residuals against the fitted values:

```
plot(fitted(m), resid(m), cex = .1)
abline(h = 0)
```

If the model fits well, the residuals in this plot should form a horizontal band of constant width, centred around 0. Here it seems that the variance of the residuals around \(\hat y = 20{,}000\) might be larger than near the boundaries, and possibly the mean of the residuals is negative for small fitted values. From the plot, model fit is not very good.

A Q-Q plot compares the empirical quantiles of the residuals to the quantiles of the fitted normal distribution. A straight line indicates good model fit.

```
qqnorm(resid(m), main = NULL)
qqline(resid(m))
```

From the plot it seems clear that the residuals are not normally distributed.

Methods for computing confidence intervals are discussed in section
7.1 of the lecture notes. Here we use the `summary(m)`

output to get all required information, but different ways of computing
confidence intervals can of course also be used.

`summary(m)`

```
##
## Call:
## lm(formula = medianHouseValue ~ medianIncome, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -515117 -51344 -13978 36193 370777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43720.2 1919.1 22.78 <2e-16 ***
## medianIncome 40179.5 482.1 83.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74180 on 9836 degrees of freedom
## Multiple R-squared: 0.4139, Adjusted R-squared: 0.4139
## F-statistic: 6947 on 1 and 9836 DF, p-value: < 2.2e-16
```

```
alpha <- 0.05
n <- nrow(train)
p <- 1
t <- qt(1 - alpha/2, n - p - 1)
cat("[", 43720.2 - 1919.1*t, ", ", 43720.2 + 1919.1*t, "]\n", sep = "")
```

`## [39958.37, 47482.03]`

\[\phantom{+}\]

**Marking Criteria**

- The data has been loaded correctly.
- The resulting model is correct and explicitly stated (including the error variance).
- There is some discussion of the model fit, including at least one of the measures discussed in the lecture notes.
- The confidence interval is computed correctly.

\[\phantom{+}\]

This task is very open-ended, and I expect that different students will come up with different solutions. Here I will just give some ideas and examples.

As a baseline, we fit the model which uses all input variables, without any transformations or interaction terms:

```
m0 <- lm(medianHouseValue ~ ., data = train)
summary(m0)
```

```
##
## Call:
## lm(formula = medianHouseValue ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -542334 -38926 -8820 29096 365582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.345e+06 7.877e+04 -42.467 < 2e-16 ***
## longitude -3.981e+04 8.983e+02 -44.312 < 2e-16 ***
## latitude -3.934e+04 8.438e+02 -46.616 < 2e-16 ***
## housingMedianAge 9.895e+02 5.606e+01 17.650 < 2e-16 ***
## totalRooms -7.969e+00 1.030e+00 -7.740 1.09e-14 ***
## totalBedrooms 8.901e+01 9.238e+00 9.635 < 2e-16 ***
## population -3.693e+01 1.527e+00 -24.178 < 2e-16 ***
## households 6.672e+01 9.984e+00 6.683 2.47e-11 ***
## medianIncome 3.930e+04 5.332e+02 73.699 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60790 on 9829 degrees of freedom
## Multiple R-squared: 0.6067, Adjusted R-squared: 0.6064
## F-statistic: 1895 on 8 and 9829 DF, p-value: < 2.2e-16
```

The adjusted \(R^2\) value is \(0.6064\), which is a significant improvement from the model in task 1 (where we got 0.4139). We also check the residual plot for this model:

```
plot(fitted(m0), resid(m0), cex = .1)
abline(h = 0)
```