This notebook performs simple calculations using Snow’s 1855 data:
A simple difference-in-differences table comparing mortality rates across regions and across time. (Snow compared counts in Table XII but for some reasons did not then convert to mortality rates.)
A quasi randomized comparison of mortality rates per household which Snow showed in his Table IX. This compares Southwark-supplied versus Lambeth-supplied customers but at a highly aggregated level (the full south London area).
A simple extension of Snow’s Table IX using imputed housing and population for comparing only the sub-districts jointly supplied by Southwark & Vauxhall and Lambeth.
For a brief introduction to Snow’s work, see:
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. The results are also saved in a self-contained html document with the suffix .nb.html. If you want pure r code (for example to run outside RStudio) you can easily extract code with the command knit(‘notebook.Rmd’,tangle=TRUE) which will save a file ‘notebook.R’ under your working directory.
Try executing the chunk below by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
# Copyright (c) 2019, Thomas Coleman
#
# ------- Licensed under BSD 2-Clause "Simplified" License -------
#
# Results and discussion in "Causality in the Time of Cholera: John Snow as a Prototype
# for Causal Inference (Working Paper)" available at SSRN: https://papers.ssrn.com/abstract=3262234
rm(list=ls()) # starts a fresh workspace
#
library(knitr)
options(scipen=5)
# Read in the data
tablevii <- read.csv(file="Snow1855_TableVII.csv", header=TRUE, sep=",", skip=5,comment.char="#")
tableviii <- read.csv(file="Snow1855_TableVIII.csv", header=TRUE, sep=",", skip=5,comment.char="#")
tableix <- read.csv(file="Snow1855_TableIX.csv", header=TRUE, sep=",", skip=5,comment.char="#")
tablexii <- read.csv(file="Snow1855_TableXII.csv", header=TRUE, sep=",", skip=5,comment.char="#")
In South London, south of the Thames around Battersea and Southwark, customers were supplied by two water companies: the Southwark & Vauxhall Water Company and the Lambeth Water Company. Snow recognized that a fortuitous set of circumstances - the Lambeth Water Company moving its water source upstream to a clean location in 1852 while the Southwark & Vauxhall Company continued to draw contaminated water from low down in the Thames - provided precisely the variation needed to test for the effect of dirty versus clean water.
In Table XII Snow sub-totals deaths by Southwark-only (“first 12”), jointly-supplied (“next 16”, some customers jointly by Southwark & Vauxhall, others by Lambeth) and Lambeth-only (“last 4”) sub-districts and then compares them (p. 89). But Snow’s comparison of counts mixes differences in populations with differences in underlying mortality, and the comparison is more clear comparing rates. The next chunk of code compares rates. (This is one instance where I think Snow missed a good opportunity to make his point more clearly and strongly.)
The code here subsets and sums across sub-districts to get the three “regions”:
Population is from Table VIII, counts from Table XII
The ‘treatment’ of changed water supply was applied in 1852, partially to joint “next 16” ‘SouthwarkVauxhall_Lambeth’ customers and not at all to “first 12” ‘SouthwarkVauxhall’ customers. We cannot use ‘Lambeth’ customers’ because they weren’t supplied by Lambeth in 1849.
# Subset and sum populations
popSouthwarkVauxhall <- sum(subset(tableviii,supplier=="SouthwarkVauxhall")[,2])
popSouthwarkVauxhall_Lambeth <- sum(subset(tableviii,supplier=="SouthwarkVauxhall_Lambeth")[,2])
popLambeth <- sum(subset(tableviii,supplier=="Lambeth")[,2])
pop1851 <- c(popSouthwarkVauxhall,popSouthwarkVauxhall_Lambeth,popLambeth)
# Extend Table XII with two extra columns for mortality rates
tablexii$deaths1849rate <- 0
tablexii$deaths1854rate <- 0
tablexii$deaths1849rate <- 10000 * tablexii$deaths1849 / c(tableviii$pop1851[1:32],pop1851)
tablexii$deaths1854rate <- 10000 * tablexii$deaths1854 / c(tableviii$pop1851[1:32],pop1851)
# Create array to hold counts
comparecounts1849_1854 <- array(0,dim=c(3,2))
colnames(comparecounts1849_1854) <- c("1849","1854")
rownames(comparecounts1849_1854) <- c("first 12 Southwark-only","next 16 Jointly-supplied","last 4 Lambeth")
# Create array to hold rates
comparerates1849_1854 <- array(0,dim=c(3,2))
colnames(comparerates1849_1854) <- c("1849","1854")
rownames(comparerates1849_1854) <- c("first 12 Southwark-only","next 16 Jointly-supplied","last 4 Lambeth")
# Create array to hold population
comparepop1849_1854 <- array(0,dim=c(3,2))
colnames(comparepop1849_1854) <- c("1849","1854")
rownames(comparepop1849_1854) <- c("first 12 Southwark-only","next 16 Jointly-supplied","last 4 Lambeth")
# Fill in arrays for counts and rates and population
x1 <- subset(tablexii,supplier=="first12")
comparecounts1849_1854[1,] <- data.matrix(x1[2:3])
comparerates1849_1854[1,] <- data.matrix(10000*x1[2:3]/pop1851[1])
comparepop1849_1854[1,] <- pop1851[1]
x1 <- subset(tablexii,supplier=="next16")
comparecounts1849_1854[2,] <- data.matrix(x1[2:3])
comparerates1849_1854[2,] <- data.matrix(10000*x1[2:3]/pop1851[2])
comparepop1849_1854[2,] <- pop1851[2]
x1 <- subset(tablexii,supplier=="last4")
comparecounts1849_1854[3,] <- data.matrix(x1[2:3])
comparerates1849_1854[3,] <- data.matrix(10000*x1[2:3]/pop1851[3])
comparepop1849_1854[3,] <- pop1851[3]
#
kable(comparecounts1849_1854, caption = "Mortality (number of deaths)",format='pandoc')
1849 | 1854 | |
---|---|---|
first 12 Southwark-only | 2261 | 2458 |
next 16 Jointly-supplied | 3905 | 2547 |
last 4 Lambeth | 162 | 37 |
kable(comparepop1849_1854, caption = "Population",format='pandoc')
1849 | 1854 | |
---|---|---|
first 12 Southwark-only | 167654 | 167654 |
next 16 Jointly-supplied | 300149 | 300149 |
last 4 Lambeth | 19133 | 19133 |
kable(comparerates1849_1854, digits=2, caption = "Mortality (rate per 10,000)",format='pandoc')
1849 | 1854 | |
---|---|---|
first 12 Southwark-only | 134.86 | 146.61 |
next 16 Jointly-supplied | 130.10 | 84.86 |
last 4 Lambeth | 84.67 | 19.34 |
The tables above are valuable but we can extend Snow’s analysis and considerably sharpen our analysis by framing the comparison as a difference-in-differences: formally controlling for the change over time and differences across regions.
Our goal is to measure the impact of clean water (the “treatment”) on mortality. We have variation across both time (1849 versus 1854) and across regions (the “first 12” Southwark-supplied versus the “next 16” jointly-supplied sub-districts). Considering the table below, the difference in rates across columns (across time) controls for differences in the 1849 versus 1854 epidemic rates. The difference across rows (“first 12” versus “next 16” sub-districts) controls for differences across regions.
Sub-districts | 1849 Mortality (rate) | 1854 Mortality (rate) | Diff 1854 less 1849 |
---|---|---|---|
First 12 Southwark-supplied | \(\mu\) | \(\mu + \delta 54\) | \(\delta 54\) |
Next 16 jointly-supplied | \(\mu + \gamma\) | \(\mu + \delta 54 + \gamma + \beta\) | \(\delta 54 + \beta\) |
Diff next-16 less first-12 | \(\gamma\) | \(\gamma + \beta\) | \(\beta\) |
Basically, we are saying that the mortality rate written as an equation is:
\(Rate = \mu + \delta 54*I(54) + \gamma*I(joint) + \beta*I(54)*I(joint)\)
The bottom-right corner of the table (\(\beta\)) shows the effect of treatment, controlled for time (1849 versus 1854) and region (“first 12” versus joint “next 16”) effects. The following code produces this table in both level (straight mortality per 10,000) and in logarithmic. For various reasons we often want to use logs, which measures differences in ration terms. (remember that changes in logs is roughly percent change - exp(ch logs) = ratio change)
# Do diff-in-diffs (in levels) for Southwark-only vs Jointly-supplied
diff_in_diffs_Level <- array(0,dim=c(3,3))
diff_in_diffs_Level[1:2,1:2] <- comparerates1849_1854[1:2,]
# Difference across columns
diff_in_diffs_Level[1:2,3] <- diff_in_diffs_Level[1:2,2] - diff_in_diffs_Level[1:2,1]
diff_in_diffs_Level[3,] <- diff_in_diffs_Level[2,] - diff_in_diffs_Level[1,]
colnames(diff_in_diffs_Level) <- c("level 1849","level 1854","col diffs")
rownames(diff_in_diffs_Level) <- c("first 12 Southwark-only","next 16 Jointly-supplied","row diffs")
# Now do it in logs
diff_in_diffs_Log <- array(0,dim=c(3,3))
diff_in_diffs_Log[1:2,1:2] <- log(comparerates1849_1854[1:2,] / 10000)
# Difference across columns
diff_in_diffs_Log[1:2,3] <- diff_in_diffs_Log[1:2,2] - diff_in_diffs_Log[1:2,1]
diff_in_diffs_Log[3,] <- diff_in_diffs_Log[2,] - diff_in_diffs_Log[1,]
colnames(diff_in_diffs_Log) <- c("log 1849","log 1854","col diffs")
rownames(diff_in_diffs_Log) <- c("first 12 Southwark-only","next 16 Jointly-supplied","row diffs")
#
kable(diff_in_diffs_Level,digits=2,caption="Diff-in-Diffs, Levels (rate per 10,000)",format='pandoc')
level 1849 | level 1854 | col diffs | |
---|---|---|---|
first 12 Southwark-only | 134.86 | 146.61 | 11.75 |
next 16 Jointly-supplied | 130.10 | 84.86 | -45.24 |
row diffs | -4.76 | -61.75 | -56.99 |
kable(diff_in_diffs_Log,digits=4,caption="Diff-in-Diffs, Logs",format='pandoc')
log 1849 | log 1854 | col diffs | |
---|---|---|---|
first 12 Southwark-only | -4.3061 | -4.2226 | 0.0835 |
next 16 Jointly-supplied | -4.3420 | -4.7694 | -0.4273 |
row diffs | -0.0359 | -0.5468 | -0.5109 |
Clean water effect in ratio terms - Expressed as decrease (Southwark -> Lambeth) it is 0.6 and as increase (Lambeth -> Southwark) it is: 1.667.
In additional notebooks I extend this analysis. One set of notebooks (Snow1855_DiDRegression1.Rmd and Snow1855_DiDRegression2_ErrorAnalysis.Rmd) discusses count regressions for diff-in-diffs. This provides the statistical framework for incorporating the sub-district variability we see in Snow’s Table XII. A second notebook (Snow1856_TableVI_DiD.Rmd) builds on the sub-district population data published in Snow’s 1856 “Cholera and the water supply in the south district of London in 1854” for more detailed analysis of the 1849 versus 1854 data.
Snow recognized the value of randomization for answering scientific questions, and he recognized that the mixing of Southwark & Vauxhall customers with Lambeth customers in South London provided a “Grand Experiment” to test the effect of dirty versus clean water:
The experiment too, was on the grandest scale. No fewer than three hundred thousand people of both sexes, of every age and occupation, and of every rank and station, from gentlefolks down to the very poor, were divided into two groups without their choice, and, in most cases, without their knowledge; one group being supplied water containing the sewage of London, and amongst it, whatever might have come from the cholera patients, the other group having water quite free from such impurity.
For a direct comparison of Southwark & Vauxhall versus Lambeth customers, Snow needed more than just the aggregate death counts from the Registrar-General: “All that was required [for direct comparison] was to learn the supply of water to each individual house where a fatal attack of cholera might occur” (Snow, 1855, p. 75) Snow was the father of “shoe-leather epidemiology” and he visited every household to determine the water supply himself. (This is a wonderful story and Snow tells it on page 76 and following. He had assistance from Mr. John Joseph Whiting, L.A.C for the “first 12” Southwark-only sub-districts since some houses were supplied by pump-wells instead of the Southwark & Vauxhall Company.)
In Table VIII Snow reports the deaths assigned to water supply (Southwark & Vauxhall Company, Lambeth Company, Other (e.g. pump-wells)) for the first seven weeks of the 1854 epidemic (ending 26th August). To appropriately compare mortality across the suppliers Snow needed the population-at-risk. The best he could do (in his 1855 publication) was houses for the whole region: a report to Parliament showed the number of houses supplied by each company, not split out by sub-district (see Snow 1855 p. 78).
These data allowed a direct comparison between Southwark versus Lambeth customers, but only at the aggregate level, summing across all sub-districts. Snow reports this in Table IX.
tableix
We can actually do a little better. We can impute the approximate number of houses (and population) separately for the “first 12” Southwark-supplied sub-districts versus the “next 16” jointly-supplied sub-districts. Customers (houses) are mixed within the jointly-supplied sub-districts, so focusing the comparison on those 16 sub-districts is a more precise comparison of mortality between the two suppliers.
First, we need to approximate the total number of houses in the different regions, which we do by assuming that the housing density (people per house) is constant across all sub-districts. The overall density is 7.36. We then distribute the total houses (66153) across the “first-12”, “next-12” and “last-4”:
tothouses <- (tableix[1,2]+tableix[2,2])
popperhouse <- (popSouthwarkVauxhall+popSouthwarkVauxhall_Lambeth+popLambeth) / tothouses
houses <- pop1851 / popperhouse
houses <- as.matrix(houses,nrow=1,ncol=3)
rownames(houses) <- c("first-12","next-16 joint","last-4")
colnames(houses) <- "imputed houses"
kable(houses,digits=2, format.args=list(big.mark=","))
imputed houses | |
---|---|
first-12 | 22,776.74 |
next-16 joint | 40,776.93 |
last-4 | 2,599.33 |
The trick we are going to use now is that the “first 12” sub-districts are supplied by only Southwark & Vauxhall, so all those 22777 houses are supplied by Southwark & Vauxhall. Similarly for the 2599.33 houses in the “last-4” supplied by Lambeth. We can then back out the houses in the jointly-supplied “next-16”. Finally we can apply the housing density to approximate population.
In terms of a table we might describe this as:
Supplier | first-12 sub-dist | next-16 sub-dist | last-4 sub-dist | All sub-dist |
---|---|---|---|---|
Southwark&Vauxhall supply | imputed from below | imputed from sides | - | original data |
Lambeth supply | - | imputed from sides | imputed from below | original data |
Both companies | calc from pop | calc from pop | calc from pop | original data |
The equations we are using are:
These two pieces of data are supplied by Snow in Table IX. We need to use our approximation for:
housesbyarea <- array(0,dim=c(4,4))
rownames(housesbyarea) <- c("Southwark supply","Lambeth supply","Southwark+Lambeth","Total")
colnames(housesbyarea) <- c("first-12","next-16 joint","last-4","Total")
housesbyarea[1,1] <- houses[1] # In the first-12 area all houses are Southwark: houses(supplier=Southwark,region=Southwark only)
housesbyarea[2,3] <- houses[3] # In the last-4 area all houses are Lambeth: houses(supplier=Lambeth,region=Lambeth only)
housesbyarea[1,2] <- tableix[1,2] - houses[1] # Southwark houses not in the first-12 are in the joint area
# And similarly for houses supplied by Lambeth in the joint area
housesbyarea[2,2] <- tableix[2,2] - houses[3] # Lambeth houses not in the last-4 are in the joint area
# The last two rows are the approximate houses by region (duplicated in cols 3 & 4)
housesbyarea[3,1:3] <- houses
housesbyarea[4,1:3] <- houses
housesbyarea[,4] <- rowSums(housesbyarea)
kable(housesbyarea,digits=2,format.args=list(big.mark=","),caption="Imputed Houses",format='pandoc')
first-12 | next-16 joint | last-4 | Total | |
---|---|---|---|---|
Southwark supply | 22,776.74 | 17,269.26 | 0.00 | 40,046 |
Lambeth supply | 0.00 | 23,507.67 | 2,599.33 | 26,107 |
Southwark+Lambeth | 22,776.74 | 40,776.93 | 2,599.33 | 66,153 |
Total | 22,776.74 | 40,776.93 | 2,599.33 | 66,153 |
kable(popperhouse * housesbyarea,digits=2,format.args=list(big.mark=","),caption="Imputed Population",format='pandoc')
first-12 | next-16 joint | last-4 | Total | |
---|---|---|---|---|
Southwark supply | 167,654 | 127,114.8 | 0 | 294,768.8 |
Lambeth supply | 0 | 173,034.2 | 19,133 | 192,167.2 |
Southwark+Lambeth | 167,654 | 300,149.0 | 19,133 | 486,936.0 |
Total | 167,654 | 300,149.0 | 19,133 | 486,936.0 |
Now we can calculate the mortality rates for the various cells of the tables we have just discussed: by supplier and by region. From Table VIII (the seven weeks ending 26th August) we sum for the sub-districts we focus on: first-12 (Southwark-only supply), next-16 (jointly-supplied), and last-4 (Lambeth-only supply).
# --- TABLE VIII - 7 weeks ending 26th August ---
deathsbyarea <- array(0,dim=c(4,4))
rownames(deathsbyarea) <- c("Southwark supply","Lambeth supply","Southwark+Lambeth","Total")
colnames(deathsbyarea) <- c("first-12","next-16 joint","last-4","Total")
# We subset Table VIII for SouthwarkVauxhall. Col 3 is deaths overall, col 4 supplier=Vauxhall, col 5 supplier = Lambeth
x1 <- subset(tableviii,supplier=="SouthwarkVauxhall")
x2 <- colSums(x1[,3:5]) # The columns for x2 are col 1 overall, col 2 supplier = Vauxhall, col 3 supplier = Lambeth
deathsbyarea[1,1] <- x2[2]
deathsbyarea[3,1] <- x2[2]+x2[3]
deathsbyarea[4,1] <- x2[1]
x1 <- subset(tableviii,supplier=="SouthwarkVauxhall_Lambeth")
x2 <- colSums(x1[,3:5]) # The columns for x2 are col 1 overall, col 2 supplier = Vauxhall, col 3 supplier = Lambeth
deathsbyarea[1,2] <- x2[2]
deathsbyarea[2,2] <- x2[3]
deathsbyarea[3,2] <- x2[2]+x2[3]
deathsbyarea[4,2] <- x2[1]
x1 <- subset(tableviii,supplier=="Lambeth")
x2 <- colSums(x1[,3:5]) # The columns for x2 are col 1 overall, col 2 supplier = Vauxhall, col 3 supplier = Lambeth
deathsbyarea[2,3] <- x2[3]
deathsbyarea[3,3] <- x2[2]+x2[3]
deathsbyarea[4,3] <- x2[1]
deathsbyarea[,4] <- rowSums(deathsbyarea)
kable(deathsbyarea,digits=0,format.args=list(big.mark=","),caption="Deaths 7 weeks ending 26th August",format='pandoc')
first-12 | next-16 joint | last-4 | Total | |
---|---|---|---|---|
Southwark supply | 738 | 525 | 0 | 1,263 |
Lambeth supply | 0 | 94 | 4 | 98 |
Southwark+Lambeth | 738 | 619 | 4 | 1,361 |
Total | 844 | 652 | 18 | 1,514 |
Finally, we can calculate the mortality rate per 10,000 households and per 10,000 persons.
deathrates <- 10000*deathsbyarea/housesbyarea
# Calculate as rate per person (using density)
deathratespop <- deathrates / popperhouse
kable(deathrates,digits=2,format.args=list(big.mark=","),caption="Mortality per 10,000 houses, 7 weeks ending 26th August",format='pandoc')
first-12 | next-16 joint | last-4 | Total | |
---|---|---|---|---|
Southwark supply | 324.01 | 304.01 | NaN | 315.39 |
Lambeth supply | NaN | 39.99 | 15.39 | 37.54 |
Southwark+Lambeth | 324.01 | 151.80 | 15.39 | 205.74 |
Total | 370.55 | 159.89 | 69.25 | 228.86 |
kable(deathratespop,digits=2,format.args=list(big.mark=","),caption="Mortality per 10,000 people, 7 weeks ending 26th August",format='pandoc')
first-12 | next-16 joint | last-4 | Total | |
---|---|---|---|---|
Southwark supply | 44.02 | 41.30 | NaN | 42.85 |
Lambeth supply | NaN | 5.43 | 2.09 | 5.10 |
Southwark+Lambeth | 44.02 | 20.62 | 2.09 | 27.95 |
Total | 50.34 | 21.72 | 9.41 | 31.09 |
We want to focus on the second column, the “next-16” jointly-supplied sub-districts. Those sub-districts mix together Southwark & Vauxhall customers with Lambeth customers, thus coming closest to true randomization. Note that the final column of the “Houses” table reproduces Snow’s final column of Table IX, but corrects a minor error in rounding (37.54 versus Snow’s reported 37).
The result simply reinforces Snow’s conclusion: focusing on the “next-16” jointly-supplied sub-districts the difference between Southwark & Vauxhall customers versus Lambeth customers is still dramatically large: a factor of 7.6.
We could do the same analysis for Table VII (the first 4 weeks of the 1854 outbreak). The code chunk below performs those calculations but does not print the results.
In another notebook I examine the population data by sub-district that Snow published in his 1856 “Cholera and the water supply in the south district of London in 1854”. With that data we can compare between Southwark & Vauxhall versus Lambeth Companies at the sub-district level. We see with those data that our approximations and imputations for the houses are not perfect, and in particular we have to incorporate the “Other” category of water supply (e.g. pump-wells) because there are some sub-districts with very few water company customers.
# --- TABLE VII - 4 weeks ending 26th August ---
deathsbyarea_4 <- array(0,dim=c(4,4))
rownames(deathsbyarea_4) <- c("Southwark supply","Lambeth supply","Southwark+Lambeth","Total")
colnames(deathsbyarea_4) <- c("first-12","next-16 joint","last-4","Total")
# We subset Table VII (first 4 weeks) for SouthwarkVauxhall. Col 3 is deaths overall, col 4 supplier=Vauxhall, col 5 supplier = Lambeth
x1 <- subset(tablevii,supplier=="SouthwarkVauxhall")
x2 <- colSums(x1[,3:5]) # The columns for x2 are col 1 overall, col 2 supplier = Vauxhall, col 3 supplier = Lambeth
deathsbyarea_4[1,1] <- x2[2]
deathsbyarea_4[3,1] <- x2[2]+x2[3]
deathsbyarea_4[4,1] <- x2[1]
x1 <- subset(tablevii,supplier=="SouthwarkVauxhall_Lambeth")
x2 <- colSums(x1[,3:5]) # The columns for x2 are col 1 overall, col 2 supplier = Vauxhall, col 3 supplier = Lambeth
deathsbyarea_4[1,2] <- x2[2]
deathsbyarea_4[2,2] <- x2[3]
deathsbyarea_4[3,2] <- x2[2]+x2[3]
deathsbyarea_4[4,2] <- x2[1]
x1 <- subset(tablevii,supplier=="Lambeth")
x2 <- colSums(x1[,3:5]) # The columns for x2 are col 1 overall, col 2 supplier = Vauxhall, col 3 supplier = Lambeth
deathsbyarea_4[2,3] <- x2[3]
deathsbyarea_4[3,3] <- x2[2]+x2[3]
deathsbyarea_4[4,3] <- x2[1]
deathsbyarea_4[,4] <- rowSums(deathsbyarea_4)
#kable(deathsbyarea_4,digits=0,format.args=list(big.mark=","),caption="Deaths 4 weeks",format='pandoc')
deathrates_4 <- 10000*deathsbyarea_4/housesbyarea
# Calculate as rate per person (using density)
deathratespop_4 <- deathrates_4 / popperhouse
#kable(deathrates,digits=2,format.args=list(big.mark=","),caption="Mortality per 10,000 houses, 7 weeks ending 26th August",format='pandoc')
#kable(deathratespop,digits=2,format.args=list(big.mark=","),caption="Mortality per 10,000 people, 7 weeks ending 26th August",format='pandoc')