This notebook extends the analysis for Table VI from Snow’s 1856 “Cholera and the water supply …” using a difference-in-differences framework. This notebook builds on the ideas in the notebook “Snow1856_TableVI_Intro” but is stand-alone can be run separately.
With population data by sub-district and by water supply company (originally published in Simon 1856, “Report on the last two cholera-epidemics of London”), Snow was able to compare mortality at the sub-district level more closely than he could in his 1855 “On the mode of communication …”. In Table VI Snow used estimates of mortality rates for Southwark & Vauxhall versus Lambeth, combined with population fractions by sub-district, to calculate population-weighted predicted mortality by sub-district. Snow’s goal was to show that differences in water supply was the predominant factor - more important than crowding or other factors - in accounting for differences across sub-districts.
Snow, however, did not have the statistical tools and methodology available to us today, and his argument had neither the clarity nor the rigor we would demand today. This notebook uses a more appropriate statistical approach: regression that uses the population by sub-district and water company to simultaneously estimate and test differences across the water companies (and across time). The conclusion (not surprisingly) overwhelmingly supports Snow’s contention that dirty versus clean water is the most important factor in accounting for differences across sub-districts (and across time). As a digression I examine earlier efforts in the literature to address and extend Snow’s Table VI, discussing why those efforts fall short.
For a brief introduction to Snow’s work, see:
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. The results are also saved in a self-contained html document with the suffix .nb.html. If you want pure r code (for example to run outside RStudio) you can easily extract code with the command knit(‘notebook.Rmd’,tangle=TRUE) which will save a file ‘notebook.R’ under your working directory.
Try executing the chunk below by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
# Copyright (c) 2019, Thomas Coleman
#
# ------- Licensed under BSD 2-Clause "Simplified" License -------
#
# Results and discussion in "Causality in the Time of Cholera: John Snow as a Prototype
# for Causal Inference (Working Paper)" available at SSRN: https://papers.ssrn.com/abstract=3262234
rm(list=ls()) # starts a fresh workspace
#
library(knitr)
options(scipen=5)
# The following libraries are used for the Negative Binomial regression and the robust standard error analysis
#install.packages("sandwich")
#install.packages("lmtest")
library("MASS")
library("sandwich")
library("lmtest")
# Read in plot functions - see the R code or the notebook 'Snow1855_DiDRegression2_ErrorAnalysis.rmd' for a description of those function
source('SnowPlotFns.r')
# Read in the data from Snow 1855 "On the mode of communication of cholera"
tableviii <- read.csv(file="Snow1855_TableVIII.csv", header=TRUE, sep=",", skip=5,comment.char="#")
tablexii <- read.csv(file="Snow1855_TableXII.csv", header=TRUE, sep=",", skip=5,comment.char="#")
# Read in the data from John Snow 1856, "Cholera and the water supply in the south district of London in 1854",
# These data were copied from the 1936 book "Snow on cholera, being a reprint of two papers" edited by Frost
# Table V by District (for running Poisson & Neg Binomial count regressions)
# Table VI by sub-district (for running Koch & Denike's tests)
# Table I "Showing the results of the Author's personal Inquiry into Twenty-One Sub-Districts"
# Table II "Showing the results of Inquiry made by Mr. Whiting in Eleven Sub-Districts"
# (My "tablei_1856" combines Snow's Tables I & II)
tablei_1856 <- read.csv(file="Snow1856_TableI.csv",
header=TRUE, sep=",", skip=5,comment.char="#")
tableV_1856 <- read.csv(file="Snow1856_TableV.csv",
header=TRUE, sep=",", skip=5,comment.char="#")
tableVI_1856 <- read.csv(file="Snow1856_TableVI.csv",
header=TRUE, sep=",", skip=5,comment.char="#")
In Snow 1856 we have estimated houses and estimated population assigned to the two water supply companies. But there are actually three sources of water: Southwark & Vauxhall Co, Lambeth Co, and “Other” (pump-wells, Thames, ditches). There is considerable variation across sub-districts not only in the proportion supplied by Southwark versus Lambeth, but also in the total proportion supplied by water companies versus “other”.
tablei_1856$combined_pop <- tablei_1856$southwark_pop + tablei_1856$lambeth_pop
tablei_1856[c("subDistrict","pop1851","southwark_pop","lambeth_pop","combined_pop")]
The estimates published in Snow 1856 Tables I and II (displayed above) show that for St. Saviour, Southwark 82.9% was supplied by the Southwark water company, 4.6% by Lambeth, and 87.4% by the two combined. For St. Mary, Newington 21.3% was supplied by the Southwark Company, 39.1% by Lambeth, and only 60.3% by the two combined.
Note in passing that there are problems with the population estimates (which Snow highlighted). Some sub-districts that were in fact not supplied by the Lambeth Company (such as St. Saviour, Southwark) have incorrectly-reported Lambeth population. Some sub-districts (such as St. Mary Magdalen) have a larger combine Southwark + Lambeth population than the 1851 census population.
To account for the variation in observed sub-district mortality we need to measure the population differences for all three sources; Snow in Table VI only accounted for the first two. To do this we can model the mortality rate for a sub-district as the population-weighted average:
\(R_{subdis,yr} = R_S * P_S + R_L^{54} * P_L * I_{yr=1854} + R_L * P_L + R_O * P_O + \delta_{54} * I_{yr=1854} + \epsilon\)
To highlight and more easily test the difference between Southwark and Lambeth we can note that \(P_S = 1 - P_L - P_O\), use \(R_S\) as the overall constant, and write
\(R_{subdis,yr} = R_S + (R_L^{54}-R_S) * P_L * I_{yr=1854} + (R_L-R_S) * P_L + (R_O-R_S) * P_O + \delta_{54} * I_{yr=1854} + \epsilon\)
so that the estimated coefficients (\((R_L-R_S)\)) are the difference from the excluded (Southwark) category.
Just as for the notebook “Snow1855_DiDRegression1” we must stack the data from Snow 1855 Tables VIII and XII and create appropriate indicator variables. Here we also need to merge in the population estimates from Snow 1856 Tables I & II.
x1 <- subset(tableviii,supplier == "SouthwarkVauxhall" | supplier == "SouthwarkVauxhall_Lambeth")
x1849 <- x1[c("subDistrict","pop1851","supplier","lambethdegree")]
x1 <- subset(tablexii,supplier == "SouthwarkVauxhall" | supplier == "SouthwarkVauxhall_Lambeth")
x1849$deaths <- x1$deaths1849
x1849$rate <- 10000 * x1$deaths1849 / x1849$pop1851
x1849$seq <- c(seq(1,length(x1849$deaths)))
#x1849$dum1854 <- 0
xyear <- factor(c(rep(1849,28),rep(1854,28)))
x1849$year <- xyear[1:28]
x1854 <- x1849
#x1849$lambethdegree <- "dirty"
x1854$deaths <- x1$deaths1854
x1854$rate <- 10000 * x1$deaths1854 / x1849$pop1851
x1854$seq <- c(seq(1,length(x1849$deaths)))
#x1854$dum1854 <- 1
x1854$year <- xyear[29:56]
# First, re-generate "redgata" after pasting on population fractions.
# For 1849 and for Southwark-only sub-districts
# Lambeth population as percent of total (1851) population. Zero out the non-Lambeth sub-districts ("first 12")
# I want three variables:
# 1) Lambeth for both 1849 & 1854
# 2) Other for both 1849 & 1854
# these two will measure against Southwark for dirty water
# 3) Lambeth for 1854 - this should measure Lambeth for clean water
x1849$otherperc <- 0
x1854$otherperc <- 0
x1849$lambethperc <- 0
x1854$lambethperc <- 0
x1849$lambethperc54 <- 0
x1854$lambethperc54 <- 0
# This calculates lambeth as percent of TOTAL population. I think this is right for
# measuring the "Lambeth effect" because it gives the proportion of the population treated (with clean Lambeth water)
# which is the variation from 1849 to 1854
x1849$otherperc <- 1 - (tablei_1856[1:28,]$southwark_pop + tablei_1856[1:28,]$lambeth_pop ) /
tablei_1856[1:28,]$pop1851
x1854$otherperc <- x1849$otherperc
# Variable for only 1954 Lambeth
x1854[13:28,]$lambethperc54 <- tablei_1856[13:28,]$lambeth_pop / tablei_1856[13:28,]$pop1851
x1849$lambethperc <- x1854$lambethperc54
x1854$lambethperc <- x1849$lambethperc
# population per house seems to come in as factor - convert to numeric
x1849$pop_per_house <- as.numeric(as.character(tablei_1856$pop_per_house[1:28]))
x1854$pop_per_house <- as.numeric(as.character(tablei_1856$pop_per_house[1:28]))
regdata <- rbind(x1849,x1854)
regdata
With the dataframe “regdata” we can now run various regressions. To start just the simple OLS linear regression.
# Linear with no population density
linpropn <- lm(rate ~ lambethperc54 + lambethperc + otherperc + year, data=regdata )
linpropnrobustse <- coeftest(linpropn, vcov = vcovHC(linpropn))
summary(linpropn)
Call:
lm(formula = rate ~ lambethperc54 + lambethperc + otherperc +
year, data = regdata)
Residuals:
Min 1Q Median 3Q Max
-62.086 -18.678 -2.177 19.836 73.334
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 161.788 9.075 17.827 < 2e-16 ***
lambethperc54 -126.532 27.008 -4.685 2.12e-05 ***
lambethperc -26.641 19.837 -1.343 0.185
otherperc -119.265 14.679 -8.125 9.27e-11 ***
year1854 11.834 11.043 1.072 0.289
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30.73 on 51 degrees of freedom
Multiple R-squared: 0.678, Adjusted R-squared: 0.6528
F-statistic: 26.85 on 4 and 51 DF, p-value: 5.142e-12
The “Lambeth” or clean water effect is the coefficient “lambethperc54” = -126.5 - the mortality per 10,000 for Lambeth company customers, as a difference from Southwark & Vauxhall company customers. The z value of -4.7 indicates that this is statistically significant. The value itself is very large, telling us that in 1854 the mortality rate of Lambeth customers was -127 per 10,000 lower than for Southwark & Vauxhall customers - a very large difference given that the base Southwark & Vauxhall mortality was 162.
The adjusted R-squared tests whether variations in the population proportions and differences in water supply account for much of the observed variation in mortality, exactly what Snow was trying to test. The high R-squared shows that indeed differences in mortality between water suppliers accounts for much of the observed variation.
The population estimates published in 1856 also included housing density estimates. One popular and very reasonable hypothesis was that crowding led to higher cholera mortality. We can test the effect of differences in crowding, within this region of south London, by including housing density as a regressor.
# Linear with population density
linpropn_den <- lm(rate ~ lambethperc54 + lambethperc + otherperc + year
+ pop_per_house , data=regdata)
linpropn_denrobustse <- coeftest(linpropn_den, vcov = vcovHC(linpropn_den))
summary(linpropn_den)
Call:
lm(formula = rate ~ lambethperc54 + lambethperc + otherperc +
year + pop_per_house, data = regdata)
Residuals:
Min 1Q Median 3Q Max
-51.80 -20.13 -2.43 17.38 74.56
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78.716 42.080 1.871 0.0673 .
lambethperc54 -126.532 26.229 -4.824 0.00001360 ***
lambethperc -32.290 19.467 -1.659 0.1034
otherperc -96.663 18.126 -5.333 0.00000233 ***
year1854 11.834 10.724 1.103 0.2751
pop_per_house 11.679 5.785 2.019 0.0489 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 29.84 on 50 degrees of freedom
Multiple R-squared: 0.7023, Adjusted R-squared: 0.6725
F-statistic: 23.59 on 5 and 50 DF, p-value: 4.262e-12
Doing so has virtually effect on the Lambeth coefficient - it is still -126.5. Housing density is marginally significant but adds little to the overall explanatory power of the regression - the Adjusted R-Squared increases from 0.653 to 0.673.
We can also include fixed effects for each sub-district. Including housing density asks “does the difference in density across sub-districts help explain mortality, or reduce the importance of clean water for Lambeth customers?” Including fixed effects asks a dramatically more general question: “are there any differences across sub-districts that help explain mortality or reduce the importance of clean water?” The answer is basically “No”.
# Linear with fixed effects
linpropn_FE <- lm(rate ~ lambethperc54 + lambethperc + otherperc + year
+ subDistrict, data=regdata )
linpropn_FErobustse <- coeftest(linpropn_FE, vcov = vcovHC(linpropn_FE))
summary(linpropn_FE)
Call:
lm(formula = rate ~ lambethperc54 + lambethperc + otherperc +
year + subDistrict, data = regdata)
Residuals:
Min 1Q Median 3Q Max
-33.59 -10.09 0.00 10.09 33.59
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 129.8562 117.2235 1.108 0.2781
lambethperc54 -126.5322 20.1596 -6.277 0.00000121 ***
lambethperc 7.4768 187.4058 0.040 0.9685
otherperc -5.9294 312.1370 -0.019 0.9850
year1854 11.8339 8.2426 1.436 0.1630
subDistrictBorough Road 71.2452 38.0904 1.870 0.0727 .
subDistrictBrixton -54.5466 76.2662 -0.715 0.4809
subDistrictCamberwell 0.7515 33.1530 0.023 0.9821
subDistrictChristchurch, Southwark 25.4153 40.1630 0.633 0.5324
subDistrictClapham -46.2542 69.5678 -0.665 0.5120
subDistrictKennington (1st) -27.1488 29.9296 -0.907 0.3727
subDistrictKennington (2nd) -39.1963 34.0302 -1.152 0.2599
subDistrictKent Road -1.3322 53.3028 -0.025 0.9803
subDistrictLambeth Church (1st) -16.1376 39.8275 -0.405 0.6887
subDistrictLambeth Church (2nd) 35.8242 34.2336 1.046 0.3050
subDistrictLeather Market 15.6610 114.2659 0.137 0.8920
subDistrictLondon Road -0.5430 66.8909 -0.008 0.9936
subDistrictPeckham -63.2201 105.0052 -0.602 0.5523
subDistrictPutney -113.8284 193.0655 -0.590 0.5606
subDistrictRotherhithe 44.1273 29.1176 1.515 0.1417
subDistrictSt. George, Camberwell -17.1792 70.6785 -0.243 0.8099
subDistrictSt. James, Bermondsey 24.3172 199.3923 0.122 0.9039
subDistrictSt. John, Horsleydown 14.9186 65.2125 0.229 0.8208
subDistrictSt. Mary Magdalen 43.3061 191.8700 0.226 0.8232
subDistrictSt. Mary, Newington -27.8881 82.4176 -0.338 0.7378
subDistrictSt. Olave, Southwark 62.0648 146.2666 0.424 0.6748
subDistrictSt. Peter, Walworth 24.8751 19.1664 1.298 0.2057
subDistrictSt. Saviour, Southwark 30.8852 80.1788 0.385 0.7032
subDistrictTrinity, Newington 13.4145 19.7972 0.678 0.5040
subDistrictWandsworth -49.3044 165.1611 -0.299 0.7677
subDistrictWaterloo Road (1st) NA NA NA NA
subDistrictWaterloo Road (2nd) NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.93 on 26 degrees of freedom
Multiple R-squared: 0.9086, Adjusted R-squared: 0.8066
F-statistic: 8.907 on 29 and 26 DF, p-value: 0.0000001375
The Adjusted R-Squared increases from 0.653 to 0.807, but the value of the Lambeth effect is unchanged and in fact is now estimated even more precisely: t value -6.3. Yes there are differences in mortality across sub-districts that are not explained only by differences in water supplier, but water supply still has a dramatically large and precisely-estimated effect.
The OLS regression above is not really appropriate because observed variables are counts - positive and integer-valued rather than the continuous rates used in the above regessions. For the observed counts the regression residuals cannot be continuous normal variables. We need to turn to Poisson or Negative Binomial regression.
The default for counts (Poisson and Negative Binomial regressions) is to estimate in log form, as discussed below. Here, howerver, we will use a linear relation between the rates and regressors:
\(R_{subdis,yr} = R_S + (R_L^{54}-R_S) * P_L * I_{yr=1854} + (R_L-R_S) * P_L + (R_O-R_S) * P_O + \delta_{54} * I_{yr=1854}\)
The rate is the count divided by the population:
\(R_{subdis,yr} = Count_{subdis,yr} / Pop_{subdis,yr}\)
To transform from the rate equation to counts we need to multiply all the regressors by the population:
\(Count_{subdis,yr} = R_S*Pop_{subdis,yr} + (R_L^{54}-R_S) * \tilde{P}_L * I_{yr=1854} + (R_L-R_S) * \tilde{P}_L + (R_O-R_S) * \tilde{P}_O + \delta_{54} * \tilde{I}_{yr=1854} + \epsilon\)
where I am using \(\tilde{P_L} ...\) to mean \(P_L * Pop_{subdis,yr}\). (This also means running the regression with no intercept and using \(Pop_{subdis,yr}\) as a regressor.)
The following code chunk runs a Poisson and Negative Binomial regression but displays only the Negative Binomial, since as usual the Poisson regression shows overdispersion (does not fit the data well) so that we cannot rely on the Poisson standard errors.
# For running linear count regressions, we need to weight all the regressors by the population (since we can't use the "offset()" method)
regdata_lin <- regdata
regdata_lin$yearind <- 1.0*(regdata$year == "1854")
regdata_lin$pop1851 <- regdata_lin$pop1851 / 10000
regdata_lin[,c("yearind","otherperc","lambethperc","lambethperc54","pop_per_house")] <-
regdata_lin[,c("yearind","otherperc","lambethperc","lambethperc54","pop_per_house")] * regdata_lin$pop1851
# Poisson regression with linear (identity) link function
pois1propn_lin <- glm(deaths ~ pop1851 + lambethperc54 + lambethperc + otherperc + yearind
- 1 , family=poisson(link="identity"), data=regdata_lin)
pois1propn_linrobustse <- coeftest(pois1propn_lin, vcov = vcovHC(pois1propn_lin))
#summary(pois1propn_lin)
#logLik(pois1propn_lin)
# Poisson FE regression with linear (identity) link function
pois2propn_lin <- glm(deaths ~ pop1851 + lambethperc54 + lambethperc + otherperc + yearind
+subDistrict - 1 , family=poisson(link="identity"), data=regdata_lin)
pois2propn_linrobustse <- coeftest(pois2propn_lin, vcov = vcovHC(pois2propn_lin))
#summary(pois2propn_lin)
#logLik(pois2propn_lin)
# Negative Binomial with linear (identity) link function
nb1propn_lin <- glm.nb(deaths ~ pop1851 + lambethperc54 + lambethperc + otherperc + yearind
- 1, link = identity, data = regdata_lin)
nb1propn_linrobustse <- coeftest(nb1propn_lin, vcov = vcovHC(nb1propn_lin))
#summary(nb1propn_lin)
#logLik(nb1propn_lin)
# Negative Binomial with linear (identity) link function & population density
nb1propn_linden <- glm.nb(deaths ~ pop1851 + lambethperc54 + lambethperc + otherperc + yearind
+ pop_per_house - 1, link = identity, data = regdata_lin)
nb1propn_lindenrobustse <- coeftest(nb1propn_linden, vcov = vcovHC(nb1propn_linden))
#summary(nb1propn_linden)
#logLik(nb1propn_linden)
# Finally, FEs
nb2propn_lin <- glm.nb(deaths ~ pop1851 + lambethperc54 + lambethperc + otherperc + yearind
+ subDistrict - 1, link = identity, data=regdata_lin)
nb2propn_linrobustse <- coeftest(nb2propn_lin, vcov = vcovHC(nb2propn_lin))
#summary(nb2propn_lin)
#logLik(nb2propn_lin)
summary(nb1propn_linden)
Call:
glm.nb(formula = deaths ~ pop1851 + lambethperc54 + lambethperc +
otherperc + yearind + pop_per_house - 1, data = regdata_lin,
link = identity, init.theta = 15.5054521)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4701 -0.8190 -0.1024 0.5172 2.6013
Coefficients:
Estimate Std. Error z value Pr(>|z|)
pop1851 100.586 36.057 2.790 0.00528 **
lambethperc54 -131.311 23.348 -5.624 0.0000000186 ***
lambethperc -49.446 23.158 -2.135 0.03275 *
otherperc -127.752 15.198 -8.406 < 2e-16 ***
yearind 12.379 9.570 1.294 0.19582
pop_per_house 10.493 4.916 2.134 0.03280 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(15.5055) family taken to be 1)
Null deviance: Inf on 56 degrees of freedom
Residual deviance: 60.315 on 50 degrees of freedom
AIC: 601.17
Number of Fisher Scoring iterations: 1
Theta: 15.51
Std. Err.: 3.37
2 x log-likelihood: -587.17
This linear Negative Binomial regression gives results similar to the linear OLS regression above.
The linear count regressions incorporate the assumption that the random variables are counts, but linearity is not necessarily a good assumption. It imposes the assumption that the treatment effect reduces mortality rates by a fixed amount independent of the base mortality level. A proportional or ratio expression would be more reasonable. Unfortunately the standard estimation routines are based on a linear-in-logs form:
\(ln(R_{subdis,yr}) = R_S + (R_L^{54}-R_S) * P_L * I_{yr=1854} + (R_L-R_S) * P_L + (R_O-R_S) * P_O + \delta_{54} * I_{yr=1854} + \epsilon\)
It is thus difficult to impose exactly the linear relation between the rate and the population fractions. (We would need a non-linear Poisson and Negative Binomial fitting routine.) In spite of this we can run the count regressions and we find essentially the same as the linear regressions: water supply is dramatically important and the effect on mortality dramatically large.
The following code chunk runs a variety of Poisson and Negative Binomial regressions but displays only the Negative Binomial with housing density.
# Poisson with no FE and no housing density
pois1propn <- glm(deaths ~ lambethperc54 + lambethperc + otherperc + year
+ offset(log(pop1851)), family=poisson, data=regdata)
pois1propnrobustse <- coeftest(pois1propn, vcov = vcovHC(pois1propn))
#summary(pois1propn)
#logLik(pois1propn)
# Poisson with housing density
pois1propn_den <- glm(deaths ~ lambethperc54 + lambethperc + otherperc + year
+ pop_per_house + offset(log(pop1851)), family=poisson, data=regdata)
pois1propn_denrobustse <- coeftest(pois1propn_den, vcov = vcovHC(pois1propn_den))
#summary(pois1propn_den)
#logLik(pois1propn_den)
# Poisson with different rates by sub-district (fixed effects)
pois2propn <- glm(deaths ~ lambethperc54 + lambethperc + otherperc + year
+ subDistrict + offset(log(pop1851)), family=poisson, data=regdata)
pois2propnrobustse <- coeftest(pois2propn, vcov = vcovHC(pois2propn))
#summary(pois2propn)
#logLik(pois2propn)
# Negative Binomial with no FE and no housing density
nb1propn <- glm.nb(deaths ~ lambethperc54 + lambethperc + otherperc + year
+ offset(log(pop1851)), data=regdata)
nb1propnrobustse <- coeftest(nb1propn, vcov = vcovHC(nb1propn))
#summary(nb1propn)
#logLik(nb1propn)
# Also run with housing density
nb1propn_den <- glm.nb(deaths ~ lambethperc54 + lambethperc + otherperc + year
+ pop_per_house + offset(log(pop1851)), data=regdata)
nb1propn_denrobustse <- coeftest(nb1propn_den, vcov = vcovHC(nb1propn_den))
#summary(nb1propn_den)
#logLik(nb1propn_den)
# Finally, FEs
nb2propn <- glm.nb(deaths ~ lambethperc54 + lambethperc + otherperc + year
+ subDistrict + offset(log(pop1851)), data=regdata)
nb2propnrobustse <- coeftest(nb2propn, vcov = vcovHC(nb2propn))
#summary(nb2propn)
#logLik(nb2propn)
summary(nb1propn_den)
Call:
glm.nb(formula = deaths ~ lambethperc54 + lambethperc + otherperc +
year + pop_per_house + offset(log(pop1851)), data = regdata,
init.theta = 11.54058123, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.97050 -0.70669 -0.06242 0.58061 2.01883
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.57160 0.43052 -10.619 < 2e-16 ***
lambethperc54 -1.38906 0.26946 -5.155 0.000000254 ***
lambethperc -0.22167 0.19759 -1.122 0.262
otherperc -0.98755 0.18964 -5.207 0.000000191 ***
year1854 0.12547 0.10999 1.141 0.254
pop_per_house 0.06625 0.05925 1.118 0.263
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(11.5406) family taken to be 1)
Null deviance: 154.987 on 55 degrees of freedom
Residual deviance: 61.328 on 50 degrees of freedom
AIC: 617.31
Number of Fisher Scoring iterations: 1
Theta: 11.54
Std. Err.: 2.48
2 x log-likelihood: -603.315
The result for the Negative Binomial with housing density is, again, that the effect of clean water (the Lambeth effect) is large and statistically significant (coefficient -1.389, ratio effect 4.01, z ratio -5.16). The regression fits the data reasonably well when measured by the Residual Deviance.
I use these tables to summarize the diff-in-diff regressions, but do not display them here. You can if you wish.
First, a code chunk for the linear regressions (but also including the Negative Binomial with population density)
# First, create the dataframes for prediction. "predict" can take in a dataframe but for some reason it only
# seems to work with a dataframe from a regression (at least I couldn't get it to work any other way).
# The idea is to take a "model.dataframe" from one of the regressions (just the first three rows) and
# then manually insert the population proportions, years, etc. that we want to predict:
# - Southwark rate, 1854, average housing density (1st row, Lambeth and "Other" population proportions 0)
# - Lambeth rate, 1854, average housing density (2nd row Lambethperc=1 and lambethperc54=1)
# - "Other" rate, 1854, average housing density (3rd row, otherperc=1)
# The idea is to take the dataframe from a fitted model, then over-write the indicators for "Lambeth", "Other"
# We need population, and if we put in 10,000 then the counts will be per 10,000 people which is the way we
# want to quote rates.
regdata_pred <- model.frame(nb1propn_linden)[1:3,]
regdata_pred$lambethperc <- c(0,1,0) # Population proportion for Lambeth
regdata_pred$lambethperc54 <- c(0,1,0) # Population proportion for Lambeth 1954
regdata_pred$otherperc <- c(0,0,1) # Population proportion for "Other"
regdata_pred$pop_per_house <- mean(regdata$pop_per_house) # set population per house to the mean
regdata_pred$year <- "1854" # Year to 1854
regdata_pred$`offset(log(pop1851))` <- log(10000) # Force population to be 10,000
regdata_pred$pop1851 <- 10000 # Force population to be 10,000
# Slight difference for predicting with linear count models, where we need to enter the
# "10,000" population as "1" (since we estimated parameters per 10,000)
regdata_predlin <- regdata_pred
regdata_predlin$pop1851 <- 1
regdata_predlin$yearind <- 1
# --------------------- #
# Create table with regression results (diff-in-diffs estimate) using robust SEs
# for all except the NB1 (Negative Binomial without sub-district fixed effects)
regtabledid1856lin <- matrix(0,nrow=16,ncol=9)
colnames(regtabledid1856lin) <- c("Lin OLS","Lin Poisson","Lin Poiss FE",
"Lin Neg Binom","Lin NB, House","Lin NB, FE",
"Log NB","Log NB H den","log NB FE")
rownames(regtabledid1856lin) <- c("Lambeth 1854 effect","SE","z value",
"lambeth perc","z value","other perc","z value","1854","z value",
"pop density","z value","Adj Rsq / resid dev", "Resid Dev p-value",
"Southwark 1854 rate","Lambeth 1854 rate","Other rate")
# Populate the table from the regressions
# First, define rows to put data into
xteffect <- c(1,2,3)
xresidev <- 12
xresidevpv <- 13
xlambeth <- c(4,5)
xother <- c(6,7)
xyear <- c(8,9)
xhden <- c(10,11)
xmortality <- c(14,15,16)
# Linear no population density
xrow <- 1
regtabledid1856lin[xteffect,xrow] <- summary(linpropn)$coefficients[2,c(1,2,3)]
regtabledid1856lin[xresidev,xrow] <- summary(linpropn)$adj.r.squared
regtabledid1856lin[xlambeth,xrow] <- summary(linpropn)$coefficients[3,c(1,3)]
regtabledid1856lin[xother,xrow] <- summary(linpropn)$coefficients[4,c(1,3)]
regtabledid1856lin[xyear,xrow] <- summary(linpropn)$coefficients[5,c(1,3)]
regtabledid1856lin[xmortality,xrow] <- predict(linpropn,newdata=regdata_pred)
# Poisson, linear form
xrow <- 2
regtabledid1856lin[xteffect,xrow] <- summary(pois1propn_lin)$coefficients[2,c(1,2,3)]
regtabledid1856lin[xresidev,xrow] <- summary(pois1propn_lin)$deviance
regtabledid1856lin[xlambeth,xrow] <- summary(pois1propn_lin)$coefficients[3,c(1,3)]
regtabledid1856lin[xother,xrow] <- summary(pois1propn_lin)$coefficients[4,c(1,3)]
regtabledid1856lin[xyear,xrow] <- summary(pois1propn_lin)$coefficients[5,c(1,3)]
regtabledid1856lin[xresidevpv,xrow] <- (1-pchisq(summary(pois1propn_lin)$deviance,summary(pois1propn_lin)$df.residual))
regtabledid1856lin[xmortality,xrow] <- predict(pois1propn_lin,newdata=regdata_predlin)
# Negative Binomial FE, linear form
xrow <- 3
regtabledid1856lin[xteffect,xrow] <- summary(pois2propn_lin)$coefficients[2,c(1,2,3)]
regtabledid1856lin[xresidev,xrow] <- summary(pois2propn_lin)$deviance
regtabledid1856lin[xlambeth,xrow] <- summary(pois2propn_lin)$coefficients[3,c(1,3)]
regtabledid1856lin[xother,xrow] <- summary(pois2propn_lin)$coefficients[4,c(1,3)]
regtabledid1856lin[xyear,xrow] <- summary(pois2propn_lin)$coefficients[5,c(1,3)]
regtabledid1856lin[xresidevpv,xrow] <- (1-pchisq(summary(pois2propn_lin)$deviance,summary(pois2propn_lin)$df.residual))
# Negative Binomial, linear form
xrow <- 4
regtabledid1856lin[xteffect,xrow] <- summary(nb1propn_lin)$coefficients[2,c(1,2,3)]
regtabledid1856lin[xresidev,xrow] <- summary(nb1propn_lin)$deviance
regtabledid1856lin[xlambeth,xrow] <- summary(nb1propn_lin)$coefficients[3,c(1,3)]
regtabledid1856lin[xother,xrow] <- summary(nb1propn_lin)$coefficients[4,c(1,3)]
regtabledid1856lin[xyear,xrow] <- summary(nb1propn_lin)$coefficients[5,c(1,3)]
regtabledid1856lin[xresidevpv,xrow] <- (1-pchisq(summary(nb1propn_lin)$deviance,summary(nb1propn_lin)$df.residual))
regtabledid1856lin[xmortality,xrow] <- predict(nb1propn_lin,newdata=regdata_predlin)
# Negative Binomial population density, linear form
xrow <- 5
regtabledid1856lin[xteffect,xrow] <- summary(nb1propn_linden)$coefficients[2,c(1,2,3)]
regtabledid1856lin[xresidev,xrow] <- summary(nb1propn_linden)$deviance
regtabledid1856lin[xlambeth,xrow] <- summary(nb1propn_linden)$coefficients[3,c(1,3)]
regtabledid1856lin[xother,xrow] <- summary(nb1propn_linden)$coefficients[4,c(1,3)]
regtabledid1856lin[xyear,xrow] <- summary(nb1propn_linden)$coefficients[5,c(1,3)]
regtabledid1856lin[xhden,xrow] <- summary(nb1propn_linden)$coefficients[6,c(1,3)]
regtabledid1856lin[xresidevpv,xrow] <- (1-pchisq(summary(nb1propn_linden)$deviance,summary(nb1propn_linden)$df.residual))
regtabledid1856lin[xmortality,xrow] <- predict(nb1propn_linden,newdata=regdata_predlin)
# Negative Binomial FE, logarithmic form
xrow <- 6
regtabledid1856lin[xteffect,xrow] <- summary(nb2propn_lin)$coefficients[2,c(1,2,3)]
regtabledid1856lin[xresidev,xrow] <- summary(nb2propn_lin)$deviance
regtabledid1856lin[xlambeth,xrow] <- summary(nb2propn_lin)$coefficients[3,c(1,3)]
regtabledid1856lin[xother,xrow] <- summary(nb2propn_lin)$coefficients[4,c(1,3)]
regtabledid1856lin[xyear,xrow] <- summary(nb2propn_lin)$coefficients[5,c(1,3)]
regtabledid1856lin[xresidevpv,xrow] <- (1-pchisq(summary(nb2propn_lin)$deviance,summary(nb2propn_lin)$df.residual))
# Negative Binomial, logarithmic form
xrow <- 7
regtabledid1856lin[xteffect,xrow] <- summary(nb1propn)$coefficients[2,c(1,2,3)]
regtabledid1856lin[xresidev,xrow] <- summary(nb1propn)$deviance
regtabledid1856lin[xlambeth,xrow] <- summary(nb1propn)$coefficients[3,c(1,3)]
regtabledid1856lin[xother,xrow] <- summary(nb1propn)$coefficients[4,c(1,3)]
regtabledid1856lin[xyear,xrow] <- summary(nb1propn)$coefficients[5,c(1,3)]
regtabledid1856lin[xresidevpv,xrow] <- (1-pchisq(summary(nb1propn)$deviance,summary(nb1propn)$df.residual))
regtabledid1856lin[xmortality,xrow] <- exp(predict(nb1propn,newdata=regdata_pred))
# Negative Binomial population density, logarithmic form
xrow <- 8
regtabledid1856lin[xteffect,xrow] <- summary(nb1propn_den)$coefficients[2,c(1,2,3)]
regtabledid1856lin[xresidev,xrow] <- summary(nb1propn_den)$deviance
regtabledid1856lin[xlambeth,xrow] <- summary(nb1propn_den)$coefficients[3,c(1,3)]
regtabledid1856lin[xother,xrow] <- summary(nb1propn_den)$coefficients[4,c(1,3)]
regtabledid1856lin[xyear,xrow] <- summary(nb1propn_den)$coefficients[5,c(1,3)]
regtabledid1856lin[xhden,xrow] <- summary(nb1propn_den)$coefficients[6,c(1,3)]
regtabledid1856lin[xresidevpv,xrow] <- (1-pchisq(summary(nb1propn_den)$deviance,summary(nb1propn_den)$df.residual))
regtabledid1856lin[xmortality,xrow] <- exp(predict(nb1propn_den,newdata=regdata_pred))
# Negative Binomial FE, logarithmic form
xrow <- 9
regtabledid1856lin[xteffect,xrow] <- summary(nb2propn)$coefficients[2,c(1,2,3)]
regtabledid1856lin[xresidev,xrow] <- summary(nb2propn)$deviance
regtabledid1856lin[xlambeth,xrow] <- summary(nb2propn)$coefficients[3,c(1,3)]
regtabledid1856lin[xother,xrow] <- summary(nb2propn)$coefficients[4,c(1,3)]
regtabledid1856lin[xyear,xrow] <- summary(nb2propn)$coefficients[5,c(1,3)]
regtabledid1856lin[xresidevpv,xrow] <- (1-pchisq(summary(nb2propn)$deviance,summary(nb2propn)$df.residual))
kable(regtabledid1856lin[c(1,3,9,10,12,13),c(2,4,5)],digits=3, caption = "Summary - Diff-in-Diffs Comparison Using Population Proportions",format='pandoc')
Lin Poisson | Lin Neg Binom | Lin NB, House | |
---|---|---|---|
Lambeth 1854 effect | -141.375 | -118.456 | -131.311 |
z value | -20.005 | -4.983 | -5.624 |
z value | 5.222 | 0.573 | 1.294 |
pop density | 0.000 | 0.000 | 10.493 |
Adj Rsq / resid dev | 712.188 | 60.115 | 60.315 |
Resid Dev p-value | 0.000 | 0.179 | 0.151 |
Note the very low predicted Lambeth mortality rate for the “Linear Negative Binomial” model. I believe this is an artifact of the linear functional form and the effect of housing density. Linearity forces the same additive effect on mortality rate, which is probably not appropriate. Southwark rates are on the order of 170 and Lambeth on the order of 20. With the estimated density coefficient roughly 10, if housing density is one percentage point lower that would be a modest effect on the Southwark rate (from 170 to 160) but a huge effect on Lambeth (from 20 to 10). Linearity is probably not a good assumption - the effect is more likely to be multiplicative.
The problem shows up particularly because of the positive correlation between density and proportion of Lambeth customers: It appears that sub-districts with a higher proportion of Lambeth customers are also higher housing density (correlation 0.71).
The numbers quoted in the table are the predicted rate for “average” housing density, 6.83 people per house. Reporting the predicted rate for “average” density probably pushes the mortality for Lambeth customers below what it should be. Using a larger housing density would give a more appropriate prediction. Note that the linear Negative Binomial regression run without housing density predicts a rate of 15.3. The logarithmic form of the Negative Binomial, which we think should be better behaved because it is based on multiplicative effects, predicts 38.9 for no housing density, and 36.8 with (average) housing density.
Below is a code chunk for the various permutations of count regressions with population proportions, six regressions in total:
# Create table with regression results (diff-in-diffs estimate) using robust SEs
# for all except the NB1 (Negative Binomial without sub-district fixed effects)
regtabledid1856 <- matrix(0,nrow=21,ncol=6)
colnames(regtabledid1856) <- c("Poiss","Poiss pop","PoissFE","NB","NB pop","NB FE")
rownames(regtabledid1856) <- c("Lambeth 1854 effect","SE","z value","p-value",
"robust SE","z value","ratio effect",
"lambeth perc","SE","z value","p-value","other perc","SE","z value","p-value",
"theta","resid dev","pop den","SE","z value","p-value")
# Populate the table from the regressions
# Poiss no population density
regtabledid1856[c(1,2,3,4),1] <- summary(pois1propn)$coefficients[2,c(1,2,3,4)]
regtabledid1856[c(5,6),1] <- pois1propnrobustse[2,c(2,3)]
regtabledid1856[7,1] <- exp(regtabledid1856[1,1])
regtabledid1856[c(8,9,10,11),1] <- summary(pois1propn)$coefficients[3,c(1,2,3,4)]
regtabledid1856[c(12,13,14,15),1] <- summary(pois1propn)$coefficients[4,c(1,2,3,4)]
regtabledid1856[17,1] <- summary(pois1propn)$deviance
# Poiss Population density
regtabledid1856[c(1,2,3,4),2] <- summary(pois1propn_den)$coefficients[2,c(1,2,3,4)]
regtabledid1856[c(5,6),2] <- pois1propn_denrobustse[2,c(2,3)]
regtabledid1856[7,2] <- exp(regtabledid1856[1,2])
regtabledid1856[c(8,9,10,11),2] <- summary(pois1propn_den)$coefficients[3,c(1,2,3,4)]
regtabledid1856[c(12,13,14,15),2] <- summary(pois1propn_den)$coefficients[4,c(1,2,3,4)]
regtabledid1856[17,2] <- summary(pois1propn_den)$deviance
regtabledid1856[c(18,19,20,21),2] <- summary(pois1propn_den)$coefficients[5,c(1,2,3,4)]
# Poiss Fixed Effects
regtabledid1856[c(1,2,3,4),3] <- summary(pois2propn)$coefficients[2,c(1,2,3,4)]
regtabledid1856[c(5,6),3] <- pois2propnrobustse[2,c(2,3)]
regtabledid1856[7,3] <- exp(regtabledid1856[1,3])
regtabledid1856[c(8,9,10,11),3] <- summary(pois2propn)$coefficients[3,c(1,2,3,4)]
regtabledid1856[c(12,13,14,15),3] <- summary(pois2propn)$coefficients[4,c(1,2,3,4)]
regtabledid1856[17,3] <- summary(pois2propn)$deviance
# NB no population density
regtabledid1856[c(1,2,3,4),4] <- summary(nb1propn)$coefficients[2,c(1,2,3,4)]
regtabledid1856[c(5,6),4] <- nb1propnrobustse[2,c(2,3)]
regtabledid1856[7,4] <- exp(regtabledid1856[1,3])
regtabledid1856[c(8,9,10,11),4] <- summary(nb1propn)$coefficients[3,c(1,2,3,4)]
regtabledid1856[c(12,13,14,15),4] <- summary(nb1propn)$coefficients[4,c(1,2,3,4)]
regtabledid1856[16,4] <- summary(nb1propn)$theta
regtabledid1856[17,4] <- summary(nb1propn)$deviance
# NB population density & water company percent
regtabledid1856[c(1,2,3,4),5] <- summary(nb1propn_den)$coefficients[2,c(1,2,3,4)]
regtabledid1856[c(5,6),5] <- nb1propn_denrobustse[2,c(2,3)]
regtabledid1856[7,5] <- exp(regtabledid1856[1,5])
regtabledid1856[c(8,9,10,11),5] <- summary(nb1propn_den)$coefficients[3,c(1,2,3,4)]
regtabledid1856[c(12,13,14,15),5] <- summary(nb1propn_den)$coefficients[4,c(1,2,3,4)]
regtabledid1856[16,5] <- summary(nb1propn_den)$theta
regtabledid1856[17,5] <- summary(nb1propn_den)$deviance
regtabledid1856[c(18,19,20,21),5] <- summary(nb1propn_den)$coefficients[5,c(1,2,3,4)]
# NB FEs
regtabledid1856[c(1,2,3,4),6] <- summary(nb2propn)$coefficients[2,c(1,2,3,4)]
regtabledid1856[c(5,6),6] <- nb2propnrobustse[2,c(2,3)]
regtabledid1856[7,6] <- exp(regtabledid1856[1,6])
regtabledid1856[c(8,9,10,11),6] <- summary(nb2propn)$coefficients[3,c(1,2,3,4)]
regtabledid1856[c(12,13,14,15),6] <- summary(nb2propn)$coefficients[4,c(1,2,3,4)]
regtabledid1856[16,6] <- summary(nb2propn)$theta
regtabledid1856[17,6] <- summary(nb2propn)$deviance
The Poisson regressions do not account for enough of the variation - the data seem to show that there is more variation than would be accounted for by constant mortality rates (constant for each supplier - Southwark & Vauxhall Company, Lambeth Company, and “Other”), even if we allow each sub-district to vary. The data seem to show that there is some random variation in mortality across the years and sub-districts, but that even so there is a large treatment effect for Lambeth Company’s clean water supply in 1854.
xfamily <- preperrdata(pois1propn_lin,single="continuous",link="linear") # this function modifies global data
#pdf(paste("../paper/figures/errbar_","pois1_didcontlin","cpdf",sep=""))
plot3(x1849,x1854,"SouthwarkVauxhall",paste("First-12 Southwark-only ",xfamily," 1849vs1854 "))
#dev.off()
#pdf(paste("../paper/figures/errbar_","pois1_didcontlin","dpdf",sep=""))
plot3(x1849,x1854,"SouthwarkVauxhall_Lambeth",paste("Next-16 Jointly-Supplied ",xfamily," 1849vs1854 "))
#dev.off()
First, we can look at the most basic Poisson regression (done in logarithmic form), including a treatment effect, a year effect, and a Southwark-versus-Lambeth effect for 1849. (This uses the plot functions in “SnowPlotFns.r”.) The red circles are the 1849 rates. The blue diamonds are 1854 rates but adjusted for the 1854 effect and the treatment effect - in other words made comparable to the 1849 rates. The empty circles are the 1849 predicted rates, with approximate 95% error bars.
This Poisson regression does not fit the data - many observations (33 out of 56 or 59%) are outside the 95% confidence bands. There is simply more variation across sub-districts than could be accounted for with a constant rate for each of the three suppliers.
xfamily <- preperrdata(pois2propn_lin,"continuous",link="linear") # this function modifies global data
#pdf(paste("../paper/figures/errbar_","pois2_didcontlin","cpdf",sep=""))
plot3(x1849,x1854,"SouthwarkVauxhall",paste("First-12 Southwark-only FE",xfamily," 1849vs1854 "))
#dev.off()
#pdf(paste("../paper/figures/errbar_","pois2_didcontlin","dpdf",sep=""))
plot3(x1849,x1854,"SouthwarkVauxhall_Lambeth",paste("Next-16 Jointly-Supplied FE",xfamily," 1849vs1854 "))
#dev.off()
A very reasonable alternative is that rates vary across sub-districts - each sub-district has it’s own rate. This is a fixed effects model, a fixed-coefficient model for rate variation across sub-districts. This explanation does not fit the data well, because there is still too much variation, in this case variation across years but within sub-districts. There are roughly 18 out of 56 or 32% of observations outside the 95% confidence bands.
xfamily <- preperrdata(nb1propn_lin,"continuous",link="linear") # this function modifies global data
#pdf(paste("../paper/figures/errbar_","nb1_didcontlin","cpdf",sep=""))
plot3(x1849,x1854,"SouthwarkVauxhall",paste("First-12 Southwark-only ",xfamily," 1849vs1854 "))
#dev.off()
#pdf(paste("../paper/figures/errbar_","nb1_didcontlin","dpdf",sep=""))
plot3(x1849,x1854,"SouthwarkVauxhall_Lambeth",paste("Next-16 Jointly-Supplied ",xfamily," 1849vs1854 "))
#dev.off()
A third alternative is that mortality rates vary across sub-districts and years in a random manner. A Negative Binomial model is a Gamma-mixture model of Poissons - individuals each have a Poisson intensity for mortality, but the Poisson intensities are distributed across the population according to a Gamma distribution, with the variance of the Gamma parameter to be estimated.
This figure shows the results, and this model fits the data reasonably well. There are 3 out of 56 or 5.45 of the observations outside the approximate 95% confidence bands.
Using the population by sub-district to measure the fraction of customers supplied by the Southwark & Vauxhall Company, the Lambeth Company, and Other (wells etc.) strongly supports Snow’s contention both that the actual and predicted mortality “bear a close relation” and that nature of the water supply exerts an “overwhelming influence”.