##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
require(ggplot2)
mtcars %>% ggplot(aes(x=hp,y=mpg)) +
geom_point(color='royalblue', size=2) +
geom_smooth(
formula = 'y ~ x',
method = 'lm',
color="red",
size=0.5,
se=F) +
xlab("Horsepower") +
ylab("Miles per Gallon") +
ggtitle("Miles per Gallon vs. Horsepower")
Our regression line is represented in red.
mtcars %>% ggplot(aes(x=hp,y=mpg)) +
geom_point(aes(color='Actual Values'), size=2) +
geom_smooth(
formula = 'y ~ x',
method = 'lm',
color="grey30",
size=0.5,
se=F) +
geom_segment(
aes(xend=hp, yend=mod$fitted.values,col='Residuals'),
size=0.2) +
# theme_minimal() +
scale_color_manual("",values = c('royalblue','red')) +
xlab("Horsepower") +
ylab("Miles per Gallon") +
ggtitle("Fitted (Modeled) Values and Residuals")
We can measure the ‘fit’ of our model by assessing the residuals.
OLS works by minimizing the Sum of Square Residuals (\(SS_R\))
\[SS_R = \sum\limits_{i=1}^{n}(y_i-\hat{y_i})^2\]
The highest residual is 8.2360, for the Toyota Corolla:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.7121 -2.1122 -0.8854 0.0000 1.5819 8.2360
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
The predicted value - based on our model - says that our mpg should be 25.66402 This is 8.2360 below the observed value.
## 1
## 25.66402
High residuals indicate that our model underpredicts.
The lowest residual is -5.7121, for the Cadillac Fleetwood:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.7121 -2.1122 -0.8854 0.0000 1.5819 8.2360
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
The predicted value - based on our model - says that our mpg should be 16.11206. This is 5.7121 above the observed value.
## 1
## 16.11206
Low residuals indicate that our model overpredicts!
Residuals are expected, as the goal of OLS is to reduce the total residuals in the model, not necessarily create the ‘best predictor’
##
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
Should we always transform our data to a normal distribution?
How do we select independent variables?
Multicolinearity:
## hp wt
## 1.766625 1.766625
Our independent variables must be exactly that: independent of one another!
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
Great \(R^2\), but none of our independent variables are significant!
## cyl disp hp drat wt qsec vs am
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487
## gear carb
## 5.357452 7.908747
Rule of thumb: under 10 for liberal, under 5 for conservative, under 4 for strict.
We will create a model predicting Median Household income as a function of Educational Attainment (% of the population 25+ with a bachelors degree or higher) and average household size in the Portland Metro Area.
suppressPackageStartupMessages(require(sf))
suppressPackageStartupMessages(require(tidycensus))
suppressPackageStartupMessages(require(tmap))
suppressMessages(tmap_mode('plot'))
cbg <- get_acs(
geography = 'block group',
state = "OR",
county = c("Multnomah","Clackamas","Washington"),
variables = c('B99121_001','B99152_001','B19013_001','B25010_001'),
output = 'wide',
geometry = TRUE) %>%
mutate(
edu = B99152_001E/B99121_001E,
mhi = B19013_001E,
hhs = B25010_001E
)
##
## Call:
## lm(formula = mhi ~ edu + hhs, data = cbg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59439 -13112 -165 12403 84948
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62005 4318 -14.36 <2e-16 ***
## edu 132531 3514 37.71 <2e-16 ***
## hhs 35265 1433 24.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20330 on 1025 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.6085, Adjusted R-squared: 0.6078
## F-statistic: 796.6 on 2 and 1025 DF, p-value: < 2.2e-16
## edu hhs
## 1.117701 1.117701
Multicollinearity is under control!
Our residuals don’t appear to have any strong pattern to them!
Same plot, with ggplot2
:
Spatial Autocorrelation!
suppressPackageStartupMessages(require(spdep))
suppressPackageStartupMessages(require(spatialreg)) # This is new to me!
cbg_nb <- poly2nb(pl = cbg, queen = TRUE)
cbg_W <- nb2listw(cbg_nb,style="W")
cbg_moran_global <- moran.mc(x = cbg$mhi, listw = cbg_W, nsim=999,na.action = na.omit)
cbg_moran_global
##
## Monte-Carlo simulation of Moran I
##
## data: cbg$mhi
## weights: cbg_W
## omitted: 230, 428, 432, 459, 471, 615, 640, 709, 782, 847, 942, 976
## number of simulations + 1: 1000
##
## statistic = 0.50705, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
Our dependent variable (median household income) has positive spatial autocorrelation!
# Add in residuals
# We have to take into account lm() has dropped entries where any value is NA!
cbg$resid_lm[complete.cases(cbg %>% st_drop_geometry)] <- mod$residuals
tm_shape(cbg) +
tm_polygons(
col = 'resid_lm',
lwd = 0.2,
border.col = 'grey80',
title = "Error Quintiles (USD)",
palette = 'RdBu',
n=5,
midpoint = 0,
contrast =c(0,1),
style = 'quantile') +
tm_layout(main.title = 'Median Household Income - OLS Residuals')
suppressMessages(tmap_mode('view'))
tmap_options(basemaps = c("Stamen.Toner",tmap_options()$basemaps))
tm_shape(cbg) +
tm_polygons(
col = 'resid_lm',
lwd = 0.2,
border.col = 'grey80',
title = "Error Quintiles (USD)",
palette = 'RdBu',
n=5,
alpha = 0.6,
midpoint = 0,
contrast =c(0,1),
style = 'quantile') +
tm_tiles("Stamen.TonerLabels") +
tm_layout(main.title = 'Median Household Income - OLS Residuals')
Low values represent areas where our model overpredicted MHI, whereas high values are areas where our model underpredicted MHI
##
## Monte-Carlo simulation of Moran I
##
## data: cbg$resid_lm
## weights: cbg_W
## omitted: 230, 428, 432, 459, 471, 530, 615, 640, 709, 782, 847, 942, 976
## number of simulations + 1: 1000
##
## statistic = 0.29454, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
Even our residuals show significant clustering!
What did we just do?
We’re looking at a spatial representation of a statistical model’s accuracy!
If only we had a way to account for spatial autocorrelation in regression models…
\[Y = \beta_0 + X_n\beta_n + \epsilon\]
where…
\(Y\) = Dependent Variable
\(\beta_0\) = Intercept
\(X_n\) = Independent Variable \(n\)
\(\beta_n\) = Coefficient \(n\)
\(\epsilon\) = Error term - assumed to be uncorrelated between two locations!
\[Y = \beta_0 + X\beta_n + \rho WY + \epsilon\]
where…
\(\rho WY\): autoregressive coefficient for the lag of the dependent variable
\[Y = \beta_0 + X_n\beta_n + \lambda W\epsilon_A + \epsilon_R\]
where…
\(\lambda\) is a measure of the strength of autocorrelation in the error
\(W\) is the weight for two locations
\(\epsilon_A\) is the autoregressive portion of the error
\(\epsilon_R\) is the random portion of the error (there’s always error!)
“Does educational attainment have an impact on median household income?”
cbg %>% ggplot(aes(x=edu, y=mhi)) +
geom_point() +
geom_smooth(
method = 'lm',
formula = 'y ~ x') +
xlab("Educational Attainment") +
ylab("Median Household Income") +
ggtitle("Correlation of Variables")
## [1] 0.6138181
cbg %>% ggplot(aes(x=hhs, y=mhi)) +
geom_point() +
geom_smooth(
method = 'lm',
formula = 'y ~ x') +
xlab("Household Size") +
ylab("Median Household Income") +
ggtitle("Correlation of Variables")
## [1] 0.2554926
cbg %>% ggplot(aes(x=hhs, y=edu)) +
geom_point() +
geom_smooth(
method = 'lm',
formula = 'y ~ x') +
xlab("Household Size") +
ylab("Educational Attainment") +
ggtitle("Correlation of Variables")
## [1] -0.3201357
We can create a variable in our dataset for log(mhi)
##
## Monte-Carlo simulation of Moran I
##
## data: cbg$mhi
## weights: cbg_W
## omitted: 230, 428, 432, 459, 471, 615, 640, 709, 782, 847, 942, 976
## number of simulations + 1: 1000
##
## statistic = 0.50705, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
##
## Monte-Carlo simulation of Moran I
##
## data: cbg$log_mhi
## weights: cbg_W
## omitted: 230, 428, 432, 459, 471, 615, 640, 709, 782, 847, 942, 976
## number of simulations + 1: 1000
##
## statistic = 0.48017, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
spdep
spdep
First, we need to build a non-spatial OLS model!
##
## Call:
## lm(formula = log_mhi ~ edu + hhs, data = cbg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28276 -0.16608 0.02716 0.18399 0.70359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.36067 0.05724 163.54 <2e-16 ***
## edu 1.66916 0.04659 35.83 <2e-16 ***
## hhs 0.46610 0.01900 24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2695 on 1025 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.5891, Adjusted R-squared: 0.5883
## F-statistic: 734.9 on 2 and 1025 DF, p-value: < 2.2e-16
Calculating VIF:
## edu hhs
## 1.117701 1.117701
Looks good!
Now we run the test!
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = log_mhi ~ edu + hhs, data = cbg)
## weights: cbg_W
##
## LMerr = 243.66, df = 1, p-value < 2.2e-16
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = log_mhi ~ edu + hhs, data = cbg)
## weights: cbg_W
##
## LMlag = 199.98, df = 1, p-value < 2.2e-16
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = log_mhi ~ edu + hhs, data = cbg)
## weights: cbg_W
##
## RLMerr = 61.063, df = 1, p-value = 5.551e-15
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = log_mhi ~ edu + hhs, data = cbg)
## weights: cbg_W
##
## RLMlag = 17.385, df = 1, p-value = 3.052e-05
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = log_mhi ~ edu + hhs, data = cbg)
## weights: cbg_W
##
## SARMA = 261.04, df = 2, p-value < 2.2e-16
summary()
## Lagrange multiplier diagnostics for spatial dependence
## data:
## model: lm(formula = log_mhi ~ edu + hhs, data = cbg)
## weights: cbg_W
##
## statistic parameter p.value
## LMerr 243.659 1 < 2.2e-16 ***
## LMlag 199.981 1 < 2.2e-16 ***
## RLMerr 61.063 1 5.551e-15 ***
## RLMlag 17.385 1 3.052e-05 ***
## SARMA 261.044 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both the Error and Lag LM values are significant.
statistic parameter p.value
LMerr 243.659 1 < 2.2e-16 ***
LMlag 199.981 1 < 2.2e-16 ***
RLMerr 61.063 1 5.551e-15 ***
RLMlag 17.385 1 3.052e-05 ***
SARMA 261.044 2 < 2.2e-16 ***
However, the robust Lagrange Multiplier value for the Error model is more significant!
statistic parameter p.value
LMerr 243.659 1 < 2.2e-16 ***
LMlag 199.981 1 < 2.2e-16 ***
RLMerr 61.063 1 5.551e-15 ***
RLMlag 17.385 1 3.052e-05 ***
SARMA 261.044 2 < 2.2e-16 ***
We have concluded that a Spatial Error Model is the most appropriate for our data!
Almost as simple as a standard OLS model:
##
## Call:spatialreg::errorsarlm(formula = formula, data = data, listw = listw,
## na.action = na.action, Durbin = Durbin, etype = etype, method = method,
## quiet = quiet, zero.policy = zero.policy, interval = interval,
## tol.solve = tol.solve, trs = trs, control = control)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.084497 -0.141398 0.015277 0.162513 0.722962
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.652549 0.061550 156.826 < 2.2e-16
## edu 1.698418 0.062326 27.250 < 2.2e-16
## hhs 0.344817 0.019618 17.576 < 2.2e-16
##
## Lambda: 0.5942, LR test value: 202.14, p-value: < 2.22e-16
## Asymptotic standard error: 0.035131
## z-value: 16.914, p-value: < 2.22e-16
## Wald statistic: 286.08, p-value: < 2.22e-16
##
## Log likelihood: -8.23079 for error model
## ML residual variance (sigma squared): 0.055488, (sigma: 0.23556)
## Nagelkerke pseudo-R-squared: 0.66248
## Number of observations: 1028
## Number of parameters estimated: 5
## AIC: 26.462, (AIC for lm: 226.6)
That’s a lot of information, let’s break it down…
Estimate Std. Error z value Pr(>|z|) (Intercept) 9.652549 0.061550 156.826 < 2.2e-16 edu 1.698418 0.062326 27.250 < 2.2e-16 hhs 0.344817 0.019618 17.576 < 2.2e-16
Lambda: 0.5942, LR test value: 202.14, p-value: < 2.22e-16 Asymptotic standard error: 0.035131 z-value: 16.914, p-value: < 2.22e-16 Wald statistic: 286.08, p-value: < 2.22e-16
Log likelihood: -8.23079 for error model ML residual variance (sigma squared): 0.055488, (sigma: 0.23556) Nagelkerke pseudo-R-squared: 0.66248 Number of observations: 1028 Number of parameters estimated: 5 AIC: 26.462, (AIC for lm: 226.6)
##
## Call:spatialreg::lagsarlm(formula = formula, data = data, listw = listw,
## na.action = na.action, Durbin = Durbin, type = type, method = method,
## quiet = quiet, zero.policy = zero.policy, interval = interval,
## tol.solve = tol.solve, trs = trs, control = control)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.214816 -0.147020 0.016532 0.167160 0.779488
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.055071 0.322998 15.650 < 2.2e-16
## edu 1.246928 0.056205 22.186 < 2.2e-16
## hhs 0.373731 0.018033 20.724 < 2.2e-16
##
## Rho: 0.42043, LR test value: 166.91, p-value: < 2.22e-16
## Asymptotic standard error: 0.03101
## z-value: 13.558, p-value: < 2.22e-16
## Wald statistic: 183.82, p-value: < 2.22e-16
##
## Log likelihood: -25.84458 for lag model
## ML residual variance (sigma squared): 0.059636, (sigma: 0.2442)
## Nagelkerke pseudo-R-squared: 0.65071
## Number of observations: 1028
## Number of parameters estimated: 5
## AIC: 61.689, (AIC for lm: 226.6)
## LM test for residual autocorrelation
## test value: 15.423, p-value: 8.5939e-05
spatialreg::impacts()
can tell us a more complete story of spatial Lag coefficients. First we need to alter the weights matrix to remove zero-linked weights (lagsarlm
did this for us):
## Impact measures (lag, exact):
## Direct Indirect Total
## edu 1.2904099 0.8610496 2.1514595
## hhs 0.3867635 0.2580750 0.6448385
Direct Indirect Total
edu 1.2904099 0.8610496 2.1514595
hhs 0.3867635 0.2580750 0.6448385
Direct Indirect Total
edu 1.2904099 0.8610496 2.1514595
hhs 0.3867635 0.2580750 0.6448385
The previous lag exercise was just that: an exercise. We can compare it to the Error model to see which is best:
## df AIC
## mod_lag 5 61.68916
## mod_err 5 26.46158
The Lagrange Multiplier test led us down the correct path! It told us before we began modeling that the Spatial Error Model would have better performance.
RMSE is the standard test statistic for geographic modeling:
## [1] 0.2355584
This is like saying “Our model is accurate +/-0.24 log(MHI)”.
Remember, we saw significant clustering of values in the OLS model residuals:
##
## Monte-Carlo simulation of Moran I
##
## data: cbg$resid_lm
## weights: cbg_W
## omitted: 230, 428, 432, 459, 471, 530, 615, 640, 709, 782, 847, 942, 976
## number of simulations + 1: 1000
##
## statistic = 0.29454, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
We can also test for this in the Error model:
##
## Monte-Carlo simulation of Moran I
##
## data: mod_err$residuals
## weights: cbg_W_sub
## number of simulations + 1: 1000
##
## statistic = -0.037667, observed rank = 21, p-value = 0.979
## alternative hypothesis: greater
We fail to reject \(H_0\) - the spatial patterns in the error are gone!
complete_records <- complete.cases(cbg %>% st_drop_geometry()) # define this first
# Place SEM residuals into the cbg object, where it is complete
cbg$resid_err <- ifelse(complete_records,mod_err$residuals,NA)
# build an OLS model to compare agains
mod_log <- lm(log(mhi) ~ edu + hhs, data =cbg)
# Place lm() residuals into the cbg object, where it is complete
cbg$lm_err <- ifelse(complete_records,mod_log$residuals,NA)
tm_shape(cbg) +
tm_polygons(
col = c('resid_err','lm_err'),
lwd = 0.2,
border.col = 'grey80',
title = c("SEM Error Quintiles (USD)","OLS Error Quintiles (USD)"),
palette = 'RdBu',
n=5,
alpha = 0.6,
midpoint = 0,
contrast =c(0,1),
style = 'quantile') +
tm_tiles("Stamen.TonerLabels")
Spatial Error Model Residuals
OLS Residuals
## [1] "R-squared: 0.0316"
Looks random, which is a good sign!
Alternatively, we have a more robust test for heteroskedasticity: The Breusch-Pagan test
##
## studentized Breusch-Pagan test
##
## data:
## BP = 25.334, df = 2, p-value = 3.153e-06
We reject \(H_0\): heteroscedasticity is present! This is different than our ‘eyeballing’ last slide.
?bptest.sarlm()
for more infoThere is still potential local clustering, but we have resolved enough to account for spatial autocorrelation in our regression model!
“Error terms across different spatial units are correlated.”
Adapted from “Spatial Regression with GeoDa”
“Variable y in place i is affected by the independent variables in both place i and j.”
Adapted from “Spatial Regression with GeoDa”
… we looked at everything in relation to everything else?
“Variable y in place i is affected by the independent variables in both place i and j, as well as the lag for the independent variables in both place i and j.”
https://www.jstor.org/stable/2983884?seq=1#metadata_info_tab_contents
When we know there are:
Basically…
IT DEPENDS
mod_durbin <- spatialreg::lagsarlm(
formula = log_mhi ~ edu + hhs,
data = cbg, # the entire object
listw = cbg_W,
na.action = na.omit,
type = "Durbin"
)
All we have to do is add type = "Durbin"
to lagsarlm()
!
##
## Call:spatialreg::lagsarlm(formula = formula, data = data, listw = listw,
## na.action = na.action, Durbin = Durbin, type = type, method = method,
## quiet = quiet, zero.policy = zero.policy, interval = interval,
## tol.solve = tol.solve, trs = trs, control = control)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0791339 -0.1341632 0.0081202 0.1595892 0.8007955
##
## Type: mixed
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.227153 0.353636 11.9534 < 2.2e-16
## edu 1.759111 0.073700 23.8686 < 2.2e-16
## hhs 0.318782 0.019940 15.9868 < 2.2e-16
## lag.edu -0.839124 0.114354 -7.3379 2.169e-13
## lag.hhs 0.032655 0.038075 0.8576 0.3911
##
## Rho: 0.51058, LR test value: 159.94, p-value: < 2.22e-16
## Asymptotic standard error: 0.03848
## z-value: 13.269, p-value: < 2.22e-16
## Wald statistic: 176.06, p-value: < 2.22e-16
##
## Log likelihood: 17.32267 for mixed model
## ML residual variance (sigma squared): 0.053896, (sigma: 0.23215)
## Nagelkerke pseudo-R-squared: 0.67885
## Number of observations: 1028
## Number of parameters estimated: 7
## AIC: -20.645, (AIC for lm: 137.29)
## LM test for residual autocorrelation
## test value: 39.571, p-value: 3.1631e-10
## df AIC
## mod_lag 5 61.68916
## mod_err 5 26.46158
## mod_durbin 7 -20.64534
The Spatial Durbin Model may be our best bet, even when accounting for the added complexity!
lapply(
X = list(mod_err,mod_lag,mod_durbin),
function(i) data.frame(model = i$type, `NK R2` = summary(i,Nagelkerke=T)$NK)
) %>% invisible() %>%
bind_rows()
## model NK.R2
## 1 error 0.6624771
## 2 lag 0.6507103
## 3 mixed 0.6788467
Again, our Nagelkerke R-squared shows that the Durbin model is likely explaining the most variance in Median Household Income.
## Impact measures (mixed, exact):
## Direct Indirect Total
## edu 1.7655436 0.1142066 1.8797502
## hhs 0.3400725 0.3779974 0.7180699
cbg$resid_dur <- ifelse(!is.na(cbg$resid_err), mod_durbin$residuals,NA)
tm_shape(cbg) +
tm_polygons(
col = 'resid_dur',
lwd = 0.2,
border.col = 'grey80',
title = "Durbin Error Quintiles (USD)",
palette = 'RdBu',
n=5,
alpha = 0.6,
midpoint = 0,
contrast =c(0,1),
style = 'quantile') +
tm_tiles("Stamen.TonerLabels")
Spatial Durbin Model Residuals
Spatial Error Model Residuals
Send me a DM on Slack telling me what you’d like to present on! Include the following, as best you can: