Examining Snow’s 1856 Table VI - Predicted Mortality with Population Data

See “Causality in the Time of Cholera” working paper at https://papers.ssrn.com/abstract=3262234 and my John Snow project website

This notebook is licensed under the BSD 2-Clause License

Introduction

This notebook starts the examination of Table VI from Snow’s 1856 “Cholera and the water supply in the south district of London in 1854” and extends the analysis using a difference-in-differences framework.

With population data by sub-district and by water supply company (originally published in Simon 1856, “Report on the last two cholera-epidemics of London”), Snow was able to compare mortality at the sub-district level more closely than he could in his 1855 “On the mode of communication …”. In Table VI Snow used estimates of mortality rates for Southwark & Vauxhall versus Lambeth, combined with population fractions by sub-district, to calculate population-weighted predicted mortality by sub-district. Snow’s goal was to show that differences in water supply was the predominant factor - more important than crowding or other factors - in accounting for differences across sub-districts.

Snow, however, did not have the statistical tools and methodology available to us today, and his argument had neither the clarity nor the rigor we would demand today. This notebook first examines Snow’s Table VI (as published) and discusses the error structure across sub-districts in some detail. This helps to explain why a simple approach (such as a paired t-test) is not appropriate for these count data.

For a brief introduction to Snow’s work, see:

  • Snow’s original 1855 monograph (it is masterful): Snow, John. 1855. On the Mode of Communication of Cholera. 2nd ed. London: John Churchill. http://archive.org/details/b28985266.
  • The best popular exposition I have found: Johnson, Steven. 2007. The Ghost Map: The Story of London’s Most Terrifying Epidemic–and How It Changed Science, Cities, and the Modern World. Reprint edition. New York: Riverhead Books.
  • Another good popular version: Hempel, Sandra. 2007. The Strange Case of the Broad Street Pump: John Snow and the Mystery of Cholera. First edition. Berkeley: University of California Press.
  • Tufte’s classic discussion of Snow’s mapping (a topic I don’t cover here): Tufte, Edward R. 1997. Visual Explanations: Images and Quantities, Evidence and Narrative. 1st edition. Graphics Press.
  • Biography: Vinten-Johansen, Peter, Howard Brody, Nigel Paneth, Stephen Rachman, and Michael Russell Rip. 2003. Cholera, Chloroform and the Science of Medicine: A Life of John Snow. Oxford; New York: Oxford University Press. Linked on-line resources https://johnsnow.matrix.msu.edu/snowworks.php

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. The results are also saved in a self-contained html document with the suffix .nb.html. If you want pure r code (for example to run outside RStudio) you can easily extract code with the command knit(‘notebook.Rmd’,tangle=TRUE) which will save a file ‘notebook.R’ under your working directory.

Try executing the chunk below by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

# Copyright (c) 2019, Thomas Coleman
#
#  -------  Licensed under BSD 2-Clause "Simplified" License  -------
#
# Results and discussion in "Causality in the Time of Cholera: John Snow as a Prototype 
# for Causal Inference (Working Paper)" available at SSRN: https://papers.ssrn.com/abstract=3262234
rm(list=ls())    # starts a fresh workspace
#
library(knitr)
options(scipen=5)
# The following libraries are used for the Negative Binomial regression and the robust standard error analysis
#install.packages("sandwich")
#install.packages("lmtest")
library("MASS")
library("sandwich") 
library("lmtest") 
# Read in the data from John Snow 1856, "Cholera and the water supply in the south district of London in 1854", 
#   These data were copied from the 1936 book "Snow on cholera, being a reprint of two papers" edited by Frost
# Table V by District (for running Poisson & Neg Binomial count regressions)
# Table VI by sub-district (for running Koch & Denike's tests)
# Table I "Showing the results of the Author's personal Inquiry into Twenty-One Sub-Districts"
# Table II "Showing the results of Inquiry made by Mr. Whiting in Eleven Sub-Districts"
# (My "tablei_1856" combines Snow's Tables I & II)
tablei_1856 <- read.csv(file="Snow1856_TableI.csv",
  header=TRUE, sep=",", skip=5,comment.char="#")
tableV_1856 <- read.csv(file="Snow1856_TableV.csv",
  header=TRUE, sep=",", skip=5,comment.char="#")
tableVI_1856 <- read.csv(file="Snow1856_TableVI.csv",
  header=TRUE, sep=",", skip=5,comment.char="#")

Check Snow’s Calculations for Table VI

There are a few rounding errors in Snow’s Table VI, for items such as “mortality_1854” (the calculated mortality rate per 10,000 for 1854). The input .csv file has Snow’s original entries but we can also recalculate the calculated and predicted entries and compare. The result of the code chunk below is a dataframe “xcompare” that has the differences between Snow’s entry and the (rounded) calculation. Most are zero. The only real error is the projected mortality rate for Southwark, Christchurch which Snow reports as 57 while it should be 51.0.

For references, the calculated entries are:

  • “pop_combined” (estimated population for Southwark & Vauxhall and Lambeth customers combined) - no errors. Note that this population may be substantially less than the overall population, presumably because there are customers not served by a water company
  • “mortality_1854” - mortality rate per 10,000, a few rounding errors
  • “deaths_Southwark_projected_rate160” - number of deaths for Southwark customers projected using population and mortality rate of 160 per 10,000
  • “deaths_Lambeth_projected_rate27” - number of deaths for Lambeth customers projected using population and mortality rate of 27 per 10,000
  • “deaths_projected_combined” - Southwark and Lambeth deaths combined
  • “mortality_projected” - the combined deaths divided by the combined population
# Calculate data and print out differences between Snow's reported data and re-calculated fractions, etc.
# There are a number of minor differences, primarily rounding errors. 
tableVI_1856$pop_combined_calc = tableVI_1856$pop_southwark + tableVI_1856$pop_lambeth 
tableVI_1856$mortality_1854_calc = 10000 * tableVI_1856$deaths_1854 / tableVI_1856$pop1851
tableVI_1856$deaths_Southwark_projected_rate160_calc = 160 * tableVI_1856$pop_southwark / 10000
tableVI_1856$deaths_Lambeth_projected_rate27_calc = 27 * tableVI_1856$pop_lambeth / 10000
tableVI_1856$deaths_projected_combined_calc = tableVI_1856$deaths_Southwark_projected_rate160_calc +
    tableVI_1856$deaths_Lambeth_projected_rate27_calc
tableVI_1856$mortality_projected_calc = 10000 * tableVI_1856$deaths_projected_combined_calc /
    tableVI_1856$pop_combined_calc
xcompare <- tableVI_1856[c("subDistrict","pop_combined","mortality_1854","deaths_Southwark_projected_rate160",
    "deaths_Lambeth_projected_rate27","deaths_projected_combined","mortality_projected")]
xcompare$pop_combined <- round(tableVI_1856$pop_combined - tableVI_1856$pop_combined_calc,digits=0)
xcompare$mortality_1854 <- round(tableVI_1856$mortality_1854 - tableVI_1856$mortality_1854_calc,digits=0)
xcompare$deaths_Southwark_projected_rate160 <- round(tableVI_1856$deaths_Southwark_projected_rate160 - tableVI_1856$deaths_Southwark_projected_rate160_calc,digits=0)
xcompare$deaths_Lambeth_projected_rate27 <- round(tableVI_1856$deaths_Lambeth_projected_rate27 - tableVI_1856$deaths_Lambeth_projected_rate27_calc,digits=0)
xcompare$deaths_projected_combined <- round(tableVI_1856$deaths_projected_combined - tableVI_1856$deaths_projected_combined_calc,digits=0)
xcompare$mortality_projected <- round(tableVI_1856$mortality_projected - tableVI_1856$mortality_projected_calc,digits=0)
Snow’s Table VI with original and re-calculated entries
tableVI_1856
Comparison of Snow’s entries versus re-calculated
xcompare

Snow’s Comparison of Actual vs Predicted Mortality

An (Inappropriate) Paired t-test

Snow’s goal with Table VI was to show that the difference in mortality was driven primarily by the difference in mortality for Southwark & Vauxhall customers versus Lambeth customers. He had the actual mortality by sub-district (“mortality_1854” in the .csv). He calculated a population-weighted mortality rate for each sub-district by calculating the counts for both suppliers and each sub-district and combining them. This is labeled “mortality_projected” (for Snow’s original entry) and “mortality_projectd_calc” for the number calculated and reported to more decimals.

Snow’s argument is that these two, actual and predicted mortality, were close:

“it will be observed that the calculated mortality bears a very close relation to the real mortality in each subdistrict. … and proves the overwhelming influence which the nature of the water supply exerted over the mortality, overbearing every other circumstance which could be expected to affect the progress of the epidemic. … [This] probably supplies a greater amount of statistical evidence than was ever brought to bear on a medical subject.”

tableVI_1856[c("subDistrict","seq_1855","mortality_1854","mortality_projected","mortality_projected_calc")]

We need a more formal method for comparing the actual versus predicted. As a first (but ultimately incorrect) approach let us try a paired t-test: for each sub-district calculate the difference between actual and predicted and test whether the average across all sub-districts is different from zero. We need to do this for mortality rates rather than counts because the populations can be quite different: Snow is comparing the overall sub-district mortality against a prediction based on only Southwark and Lambeth supplied customers, a population that may be lower.

For example for Putney the overall population is 5,280 while the estimated combined Southwark plus Lambeth population is only 74: many houses were supplied by pump-wells or the Thames. For Kennington 1st the total population is 24,261 while the combined Southwark and Lambeth population is only 18,483. The total observed count (total population 24,261) is 305, giving an estimated mortality rate of 125.7 per 10,000. If we apply the rate of 125.7 to the Southwark and Lambeth population (18,483) we would have an expected count of only 232.2. Comparing 305 versus 232.2 is a difference of 72.7 but this reflects only differing population size and not underlying mortality or infection; the rates are identical.

If we go ahead and perform a paired t-test on the differences in rates (even thought, as we will see, it is not appropriate), we find that the differences are not significant.

trates <- t.test(tableVI_1856$mortality_1854_calc[1:31],tableVI_1856$mortality_projected_calc[1:31],paired=TRUE)
print(trates)

    Paired t-test

data:  tableVI_1856$mortality_1854_calc[1:31] and tableVI_1856$mortality_projected_calc[1:31]
t = -1.4284, df = 30, p-value = 0.1635
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -24.644318   4.358709
sample estimates:
mean of the differences 
               -10.1428 

The t-ratio for the paired differences is -1.43, which is not significant. The reason is that the mean (-10.14) is relatively small compared to the standard deviation for the paired differences (39.53). (Remember that the t-ratio is the ratio of the mean to the standard error, or the standard deviation divided by square-root degrees of freedom, \(\sqrt{30}\).)

Graphs of Error Structure

Graphs help to clarify why the paired t-test is not appropriate. The following shows the difference in mortality (actual less predicted) with approximate 95% confidence bands assuming that the differences are normal and drawn from a common distribution. The paired t-test assumes that the observations are all normal, so the differences are normal, and all with a common variance or standard deviation (39.53).

# Function to plot difference in mortality rates and approximate 95% intervals
plotdiff_stderr <- function(yseq, xdiff,xstderr,title,spread=0) {            
    xplot <- plot(xdiff, yseq,
        xlim=range(c(xdiff, -1.96*xstderr,1.96*xstderr,spread)),
        ylim=rev(range(yseq)), col="red",
        main=title,xlab="Difference in Mortality Rates, approx error bars",ylab="sub-district",
        pch=19)
    # horizontal error bars
    xplot <- arrows(-1.96*xstderr, yseq, 1.96*xstderr, yseq, length=0.05, angle=90, code=3,lty=3)
    xplot
}
yseq <- tableVI_1856[1:31,]$seq_1855
#xexcl_Putney <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26,28,29,30,31)
# Calculate difference in mortality rates
xdiff <- tableVI_1856$mortality_1854_calc[1:31]-tableVI_1856$mortality_projected_calc[1:31]
# Calculate the standard error for just a standard paired t-test:
xm <- mean(tableVI_1856$mortality_1854_calc[1:31]-tableVI_1856$mortality_projected_calc[1:31])
xs <- sd(tableVI_1856$mortality_1854_calc[1:31]-tableVI_1856$mortality_projected_calc[1:31])
xstderr_ttest <- xs 
# Calculate the standard error assuming each sub-district is Poisson.
# We will graph this in the next chunk of code, but we need it here to get the x-scale
#   for this graph (so they all match)
# 1854 mortality rate divided by 1851 population (for calculating standard error)
x1 <- (tableVI_1856$mortality_1854_calc[1:31] / tableVI_1856$pop1851[1:31]) / 10000
# Mortality rate for Southwark and Lambeth combined, divided by the population for combined
x2 <- (tableVI_1856$mortality_projected_calc[1:31] / tableVI_1856$pop_combined_calc[1:31]) / 10000
xstderr_poiss <- 10000 * sqrt(x1 + x2)     # Std Err of difference in mortality rates (assuming Poisson)
#pdf(paste("../paper/figures/errbar_diff1856_t","a.pdf",sep=""))
    title <- "Actual minus Snow Predicted, as if paired t-test"
  spread <- c(-1.96*xstderr_poiss,1.96*xstderr_poiss)
    plotdiff_stderr(yseq,xdiff,xstderr_ttest,title,spread)   # Plot for all sub-districts
NULL

#dev.off()

Sub-districts are displayed in the order from Snow’s 1855 publication - shown in the variable “seq_1855” in the Table VI .csv. Error bars are drawn at \(\pm\) 1.96 * Standard Deviation of differences; this is = \(\pm\) 77.49. These give the intuition for the t-test, showing that most of the observed differences are within the 95% error bars and thus relatively small relative to the observed variability across sub-districts.

But the assumption for this graph, that all sub-district pairs have the same variance, is not correct. These are mortality rates, derived from counts based on finite populations. There will be variability simply because of the finite sub-district populations. Some are small - as mentioned above Putney has only 74 people supplied by the two water companies. Putney should have a much larger standard error than, say, Kennington 1st with a combined population of 18,483.

We can calculate approximate standard errors for each sub-district pair. Let us make the very reasonable assumption that the counts for each sub-district are Bernoulli random variables, the sum of which is Binomial and will go towards normal for large populations. The rates (average of the counts) will go to normal (by the Central Limit Theorem) with standard error \(\sqrt{\frac{r*(1-r)}{n}}\) which for Poisson (which approximates the Bernoulli / Binomial) is \(\sqrt{\frac{\lambda}{n}}\). The difference between two rates with different populations (actual and predicted here) will have approximate standard error \(\sqrt{\frac{\lambda1}{n1} - \frac{\lambda2}{n2}}\).

# Calculate difference in death counts - but don't graph this
xdiffc <- tableVI_1856$deaths_1854[1:31]-tableVI_1856$deaths_projected_combined_calc[1:31]
# 1854 deaths 
x1 <- tableVI_1856$deaths_1854[1:31] 
# deaths for Southwark and Lambeth combined
x2 <- tableVI_1856$deaths_projected_combined_calc[1:31] 
xstderrc <-  sqrt(x1 + x2)     # Std Err of difference in deaths (assuming Poisson)
title <- "Actual minus Snow Predicted, Poisson Error Bars"
#plotdiff_stderr(yseq,xdiffc,xstderrc,title)   # Plot for all sub-districts
#pdf(paste("/Users/tcoleman/tom/Economics/Harris/research/snow_cholera/paper/figures/errbar_diff1856_pois","a.pdf",sep=""))
    title <- "Actual minus Snow Predicted, Poisson Error Bars"
    plotdiff_stderr(yseq,xdiff,xstderr_poiss,title)   # Plot for all sub-districts
NULL

This graph shows the approximate Poisson error bars. Because some of the sub-districts (Putney, sequence number 10, for example) have very low water company population the standard error can be large. The difference for Putney is large but it is well within the standard error when we account for the small population of water company customers.

This graph shows why the paired t-test is not appropriate: it does not incorporate the variation due to finite sub-district populations. In the end we need to combine both the cross-sub-district variation shown in the first graph with the finite-population variation shown here.

One possible statistical assumption is that the counts are Negative Binomial, meaning that the sub-districts are Poisson but the Poisson rates are random and drawn from a Gamma distribution. This is not the only possible assumption but it is convenient.

# Calculate difference in mortality rates, but assuming Negative Binomial rather than simple Poisson
#   Poisson, Variance = rate * N => Var / N^2 = rate / N
#   NB, Variance = rate*N + (rate*N)^2 / theta => Var / N^2 = rate/N + rate^2/theta
theta <- 12.8
# Variance for actual mortality rates - variance for NB is as above 
x1 <- ((tableVI_1856$mortality_1854_calc[1:31]/10000) / tableVI_1856$pop1851[1:31] 
    + (tableVI_1856$mortality_1854_calc[1:31]/10000)^2 / theta) 
# Variance for predicted mortality rates - variance for NB is as above
    x2 <- ((tableVI_1856$mortality_projected_calc[1:31]/10000) / tableVI_1856$pop_combined_calc[1:31] 
        + (tableVI_1856$mortality_projected_calc[1:31]/10000)^2 / theta) 
#x2 <- (tableVI_1856$mortality_projected_calc[1:31] / tableVI_1856$pop_combined_calc[1:31]) / 10000
xstderr_nb <- 10000 * sqrt(x1 + x2)     # Std Err of difference in mortality rates (assuming Poisson)
#pdf(paste("../paper/figures/errbar_diff1856_nb","a.pdf",sep=""))
    title <- "Actual minus Snow Predicted, Neg Bin Error Bars"
    plotdiff_stderr(yseq,xdiff,xstderr_nb,title)   # Plot for all sub-districts
NULL

#dev.off()

This graph shows approximate error bars for the Negative Binomial standard error of \(\sqrt{\left(\frac{\lambda1}{n1} + \frac{\lambda1^2}{\theta}\right) - \left(\frac{\lambda2}{n2} + \frac{\lambda2^2}{\theta}\right)}\) (and using a reasonable value of the Negative Binomial parameter - \(\theta\)=12.8).

This is not a formal estimation or testing framework (we turn to that in the next section) but it shows how and why we need to incorporate both cross-sub-district and within-sub-district variation. Even though the above graph is only heuristic and approximate, we can see that all the differences fit within these “error bars”. We will now turn to a more appropriate structure where we will see that yes, Snow was right, and differences in mortality across different water sources was a major source of the variation both across sub-districts and across time within sub-districts.

Table V

Assigning the 623 “Unascertained” Deaths

I now turn from Table VI (data by sub-district but at the combined Southwark-plus-Lambeth level) to Table V (data by District but dis-aggregated by supplier: Southwark & Vauxhall, Lambeth, and “Other”).

For the quasi-randomized comparisons in a later notebook we need deaths assigned to water source or supplier. There are three: Southwark & Vauxhall Company; Lambeth Company; and “Other” (pump-wells, the Thames, ditches). The tables that supply data by supplier are as follows:

  • Snow 1855 Table VII and VIII, Snow 1856 Tables I, II: by sub-district (32 sub-districts) for part of the epidemic: the four weeks ending 5th August and seven seeks ending 26th August. Collected by Snow and Whiting
  • Snow 1856 Table V: by Registration District (11 Districts) for the full 1854 epidemic (ending October) and matching the dates for Snow 1855 Table XII

The background to these data are that in the summer of 1854 during the London epidemic, Snow recognized the importance of measuring deaths by supplier. He spent his free time visiting all the houses with a recorded death and determining the source of water supply (see Snow’s narrative, Snow 1855 p 76 ff - Snow is indeed the father of “shoe leather epidemiology”). Snow - with assistance from Mr. John Joseph Whiting, L.A.C. - ascertained the supply for deaths for the seven weeks ending 26th August 1854. Their results are reported in Snow 1855 Tables VII and VIII and Snow 1856 Tables I, II, and III. Snow also persuaded the Registrar-General to collect data for the rest of the epidemic. These data are reported in Snow 1856 Table IV, and the combined data (for the whole 1854 epidemic) in Table V. (Note - to my knowledge data assigning deaths to water supplier by sub-district were never published for the full 1854 epidemic, at least in a form that matched Snow’s Table XII.)

There were, however, difficulties in determining the source for houses that were supplied by either the Southwark & Vauxhall Company or the Lambeth Company. Homeowners and particularly renters might not remember which company actually supplied the water. (Note that there was little difficulty for houses supplied by “Other” - pump-wells, Thames, etc.) Snow was diligent in determining the source (and developed a chemical test that could determine in the absence of definitive billing or other records). For the seven weeks ending 26th August there were 22 out of 1,514 deaths (1.5%) “unascertained”. The Registrar-General’s reporters were less careful - 601 out of 3,564 deaths (16.9%) unascertained. Note, however, that the location of each of these 623 deaths was known and assigned to District and sub-district.

For the quasi-randomized comparisons discussed in a later notebook we would like to estimate or approximate the assignment of these 623 deaths to the two water companies. Snow himself proposed a simple but reasonable method:

The instances in which the water supply was not specified, or not ascertained, in the returns made by the district registrars must evidently nearly all have been cases in which the house was supplied by one or other of the water companies, for, if the persons received no such supply, and obtained water from a pump well, canal, or ditch, there could be no difficulty in knowing the fact. Moreover, as the two water companies are guided by precisely the same regulations, the difficulty in ascertaining the supply is exactly the same with regard to one as the other; I, therefore, concluded that I could not be wrong in dividing the non-ascertained cases between the two companies in the same proportion as those which were ascertained, and I have done so at the foot of table V (Snow (1856) p. 247)

The following code chunk adjust Southwark and Lambeth deaths and rates in Table V by assigning the “unascertained” on a District-by-District basis using within-district proportions, basically following Snow’s proposal. Note, importantly, that within a District this does not change the total number of deaths but simply re-assigns the “unascertained” to Southwark & Vauxhall and Lambeth.

tableV_1856$mortality <- 10000 * tableV_1856$deaths_total / tableV_1856$pop1851
# Create table to hold Koch & Denike's numbers
tableV_Koch <- tableV_1856
# Adjust Southwark and Lambeth deaths by "unascertained" by within-district proportion of Southwark vs Lambeth
tableV_1856$deaths_southwark_adj <- tableV_1856$deaths_southwark + tableV_1856$deaths_unascertained * tableV_1856$deaths_southwark / (tableV_1856$deaths_southwark + tableV_1856$deaths_lambeth)
tableV_1856$deaths_lambeth_adj <- tableV_1856$deaths_lambeth + tableV_1856$deaths_unascertained * tableV_1856$deaths_lambeth / (tableV_1856$deaths_southwark + tableV_1856$deaths_lambeth)
# Display relevant columns
tableV_1856[c("district","pop1851","deaths_southwark","deaths_lambeth","deaths_pumps","deaths_unascertained","deaths_total","mortality","deaths_southwark_adj","deaths_lambeth_adj")]

Problems with Koch & Denike (2006)

T. Koch and K. Denike (“Rethinking John Snow’s South London study: A Bayesian evaluation and recalculation”, Social Science & Medicine vol 63, no 1, July 2006) undertake a valuable exercise to examine Snow (1856) and Snow’s analysis of the South London data, focused most particularly on the 623 unascertained deaths. They state as their goal the correction of Snow’s analysis: “This paper describes a previously unacknowledged methodological and conceptual problem in Snow’s 1856 argument. We review the context of the South London study, identify the problem and then correct it with an empirical Bayes estimation (EBE) approach.”

Unfortunately they fail in their goal. They fail for two important reasons. First, the statistical test they examine (paired t-test shown in their Figure 5) is both inappropriate and not properly applied. Second, and much more seriously, they seem to misunderstand or misinterpret Snow’s data and analysis and as a result they inappropriately alter the mortality data in Snow 1856 Table V - the data from the Registrar-General used by Snow and others. Although not ill-intentioned this altering of the original mortality data does invalidate their analysis and conclusions.

First consider the paired t-test. As discussed above, this test is not appropriate because it does not incorporate the fact that deaths are counts and mortality rates have standard errors that depend on population sizes. However, if the test were to be used it would have to be applied to rate and not counts (again because of differences in populations sizes, this time between pairs).

First we re-print the results of the paired t-test based on rates:

print(trates)

    Paired t-test

data:  tableVI_1856$mortality_1854_calc[1:31] and tableVI_1856$mortality_projected_calc[1:31]
t = -1.4284, df = 30, p-value = 0.1635
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -24.644318   4.358709
sample estimates:
mean of the differences 
               -10.1428 

This test shows no significant difference (on average) between the pairs, which would support Snow’s assertion of a “close relation” between predicted and actual.

Koch & Denike, however, run the test using counts:

tcounts <- t.test(tableVI_1856$deaths_1854[1:31],tableVI_1856$deaths_projected_combined_calc[1:31],paired=TRUE)
print(tcounts)

    Paired t-test

data:  tableVI_1856$deaths_1854[1:31] and tableVI_1856$deaths_projected_combined_calc[1:31]
t = 3.4048, df = 30, p-value = 0.0019
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 11.50944 46.01240
sample estimates:
mean of the differences 
               28.76092 

This test does show a high degree of significance, rejecting the hypothesis that the counts are the same. From this Koch & Denike conclude “his [Snow’s] conclusion was less than convincing”. (Koch & Denike p. 278) But for the reasons discussed (the paired t-test test is not an appropriate test, and should in any case be run with rates), Koch & Denike’s conclusion is incorrect.

The more serious problem with Koch & Denike’s paper is that they alter underlying data in a way that is neither necessary nor justified. The problem Koch & Denike set out to address is the 623 deaths that were not assigned to water company supplier (Southwark & Vauxhall Company versus Lambeth Company). They state “there were 623 houses in which cholera occurred that could not be assigned reflexively to any single district nor to either of the two water supplier areas.” (p. 275) The first part of this statement is simply incorrect while the second part is slightly confusing. There were 623 deaths in houses where the house could not be assigned to water supplier, but all those death (and the houses in which they occurred) were clearly assigned, in the original reports from the Registrar-General, to Registration District and sub-district. The assignments to Registration District are clearly shown in Snow’s Table V. There has never (to my knowledge) been any question about the reliability of the Registrar-General’s assignment of deaths to District or Sub-District – in contrast to assignment to water supplier within sub-district. Throughout, Koch & Denike imply, incorrectly, that the problem with the 623 deaths is spatial location, when in fact it is assignment (within sub-district and District) to water supply.

Koch & Denike re-allocate the 623 deaths across water supplier and Registration Districts. In doing so they move deaths across Districts, deaths that were reliably located – assigned to District by the Registrar-General. The re-assignment across Districts is neither necessary nor justified and corrupts the original data. The re-assignment introduces substantive errors in mortality rates for those districts with either below-average or above-average unassigned deaths.

The following code chunk creates a copy of Table V and inserts Koch & Denike’s re-assigned deaths for Southwark & Vauxhall and Lambeth from their Figure 6. The code then calculates for each District the overall mortality (for all sources) according to Koch & Denike, and displays this next to the true mortality rate per 10,000 persons.

# Populate Koch & Denike's 
tableV_Koch$deaths_unascertained <- 0
tableV_Koch$deaths_southwark[1:10] <- c(467.96,317.63,943.92,447.88,527.36,605.61,307.59,405.02,237.06,6.78)
tableV_Koch$deaths_southwark[12] <- sum(tableV_Koch$deaths_southwark[1:10])
tableV_Koch$deaths_lambeth[1:10] <- c(82.37,1.28,1.32,112.82,66.72,157.72,9.03,38.24,1.26,2.43)
tableV_Koch$deaths_lambeth[12] <- sum(tableV_Koch$deaths_lambeth[1:10])
tableV_Koch$deaths_total <- tableV_Koch$deaths_southwark + tableV_Koch$deaths_lambeth + tableV_Koch$deaths_pumps + tableV_Koch$deaths_unascertained
tableV_Koch$mortality <- 10000 * tableV_Koch$deaths_total / tableV_Koch$pop1851
tableV_Koch$mortality_true <- tableV_1856$mortality
# Display some of the columns
tableV_Koch[c("district","pop1851","deaths_southwark","deaths_lambeth","deaths_pumps","deaths_unascertained","deaths_total","mortality","mortality_true")]

Comparing the actual mortality to Koch & Denike’s altered mortality rates we can see that they are the same at the aggregate South London level (all Districts combined) but they differ substantially for a number of Districts. For example the mortality rate for St. Saviour, Southwark is increased from 137.4 to 156.8 (per 10,000), while Lambeth is reduced from 66.5 to 56.5. All of these changes in overall mortality at the Registration District level are arbitrary and counter to the observed data.

LS0tCnRpdGxlOiAiSm9obiBTbm93IFByb2plY3QgLSAxODU2IFRhYmxlIFZJIEludHJvZHVjdGlvbiIKYXV0aG9yOiAiW1Rob21hcyBDb2xlbWFuXShodHRwOi8vd3d3LmhpbGVydW4ub3JnL2Vjb24pIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCiMgRXhhbWluaW5nIFNub3cncyAxODU2IFRhYmxlIFZJIC0gUHJlZGljdGVkIE1vcnRhbGl0eSB3aXRoIFBvcHVsYXRpb24gRGF0YQoKIyMjIyBTZWUgIkNhdXNhbGl0eSBpbiB0aGUgVGltZSBvZiBDaG9sZXJhIiB3b3JraW5nIHBhcGVyIGF0IGh0dHBzOi8vcGFwZXJzLnNzcm4uY29tL2Fic3RyYWN0PTMyNjIyMzQgYW5kIG15IFtKb2huIFNub3cgcHJvamVjdCB3ZWJzaXRlXShodHRwOi8vd3d3LmhpbGVydW4ub3JnL2Vjb24vcGFwZXJzL3Nub3cpCgojIyMjIFRoaXMgbm90ZWJvb2sgaXMgbGljZW5zZWQgdW5kZXIgdGhlIFtCU0QgMi1DbGF1c2UgTGljZW5zZV0oaHR0cHM6Ly9vcGVuc291cmNlLm9yZy9saWNlbnNlcy9CU0QtMi1DbGF1c2UpCgojIyMgSW50cm9kdWN0aW9uCgpUaGlzIG5vdGVib29rIHN0YXJ0cyB0aGUgZXhhbWluYXRpb24gb2YgVGFibGUgVkkgZnJvbSBTbm93J3MgMTg1NiAiQ2hvbGVyYSBhbmQgdGhlIHdhdGVyIHN1cHBseSBpbiB0aGUgc291dGggZGlzdHJpY3Qgb2YgTG9uZG9uIGluIDE4NTQiIGFuZCBleHRlbmRzIHRoZSBhbmFseXNpcyB1c2luZyBhIGRpZmZlcmVuY2UtaW4tZGlmZmVyZW5jZXMgZnJhbWV3b3JrLiAKCldpdGggcG9wdWxhdGlvbiBkYXRhIGJ5IHN1Yi1kaXN0cmljdCBhbmQgYnkgd2F0ZXIgc3VwcGx5IGNvbXBhbnkgKG9yaWdpbmFsbHkgcHVibGlzaGVkIGluIFNpbW9uIDE4NTYsICJSZXBvcnQgb24gdGhlIGxhc3QgdHdvIGNob2xlcmEtZXBpZGVtaWNzIG9mIExvbmRvbiIpLCBTbm93IHdhcyBhYmxlIHRvIGNvbXBhcmUgbW9ydGFsaXR5IGF0IHRoZSBzdWItZGlzdHJpY3QgbGV2ZWwgbW9yZSBjbG9zZWx5IHRoYW4gaGUgY291bGQgaW4gaGlzIDE4NTUgIk9uIHRoZSBtb2RlIG9mIGNvbW11bmljYXRpb24gLi4uIi4gSW4gVGFibGUgVkkgU25vdyB1c2VkIGVzdGltYXRlcyBvZiBtb3J0YWxpdHkgcmF0ZXMgZm9yIFNvdXRod2FyayAmIFZhdXhoYWxsIHZlcnN1cyBMYW1iZXRoLCBjb21iaW5lZCB3aXRoIHBvcHVsYXRpb24gZnJhY3Rpb25zIGJ5IHN1Yi1kaXN0cmljdCwgdG8gY2FsY3VsYXRlIHBvcHVsYXRpb24td2VpZ2h0ZWQgcHJlZGljdGVkIG1vcnRhbGl0eSBieSBzdWItZGlzdHJpY3QuIFNub3cncyBnb2FsIHdhcyB0byBzaG93IHRoYXQgZGlmZmVyZW5jZXMgaW4gd2F0ZXIgc3VwcGx5IHdhcyB0aGUgcHJlZG9taW5hbnQgZmFjdG9yIC0gbW9yZSBpbXBvcnRhbnQgdGhhbiBjcm93ZGluZyBvciBvdGhlciBmYWN0b3JzIC0gaW4gYWNjb3VudGluZyBmb3IgZGlmZmVyZW5jZXMgYWNyb3NzIHN1Yi1kaXN0cmljdHMuIAoKU25vdywgaG93ZXZlciwgZGlkIG5vdCBoYXZlIHRoZSBzdGF0aXN0aWNhbCB0b29scyBhbmQgbWV0aG9kb2xvZ3kgYXZhaWxhYmxlIHRvIHVzIHRvZGF5LCBhbmQgaGlzIGFyZ3VtZW50IGhhZCBuZWl0aGVyIHRoZSBjbGFyaXR5IG5vciB0aGUgcmlnb3Igd2Ugd291bGQgZGVtYW5kIHRvZGF5LiBUaGlzIG5vdGVib29rIGZpcnN0IGV4YW1pbmVzIFNub3cncyBUYWJsZSBWSSAoYXMgcHVibGlzaGVkKSBhbmQgZGlzY3Vzc2VzIHRoZSBlcnJvciBzdHJ1Y3R1cmUgYWNyb3NzIHN1Yi1kaXN0cmljdHMgaW4gc29tZSBkZXRhaWwuIFRoaXMgaGVscHMgdG8gZXhwbGFpbiB3aHkgYSBzaW1wbGUgYXBwcm9hY2ggKHN1Y2ggYXMgYSBwYWlyZWQgKnQqLXRlc3QpIGlzIG5vdCBhcHByb3ByaWF0ZSBmb3IgdGhlc2UgY291bnQgZGF0YS4gCgoKRm9yIGEgYnJpZWYgaW50cm9kdWN0aW9uIHRvIFNub3cncyB3b3JrLCBzZWU6CgorICoqU25vdydzIG9yaWdpbmFsIDE4NTUgbW9ub2dyYXBoKiogKGl0IGlzIG1hc3RlcmZ1bCk6IFNub3csIEpvaG4uIDE4NTUuICpPbiB0aGUgTW9kZSBvZiBDb21tdW5pY2F0aW9uIG9mIENob2xlcmEqLiAybmQgZWQuIExvbmRvbjogSm9obiBDaHVyY2hpbGwuIGh0dHA6Ly9hcmNoaXZlLm9yZy9kZXRhaWxzL2IyODk4NTI2Ni4KKyAqKlRoZSBiZXN0IHBvcHVsYXIgZXhwb3NpdGlvbiBJIGhhdmUgZm91bmQqKjogSm9obnNvbiwgU3RldmVuLiAyMDA3LiAqVGhlIEdob3N0IE1hcDogVGhlIFN0b3J5IG9mIExvbmRvbuKAmXMgTW9zdCBUZXJyaWZ5aW5nIEVwaWRlbWljLS1hbmQgSG93IEl0IENoYW5nZWQgU2NpZW5jZSwgQ2l0aWVzLCBhbmQgdGhlIE1vZGVybiBXb3JsZCouIFJlcHJpbnQgZWRpdGlvbi4gTmV3IFlvcms6IFJpdmVyaGVhZCBCb29rcy4KKyAqKkFub3RoZXIgZ29vZCBwb3B1bGFyIHZlcnNpb24qKjogSGVtcGVsLCBTYW5kcmEuIDIwMDcuICpUaGUgU3RyYW5nZSBDYXNlIG9mIHRoZSBCcm9hZCBTdHJlZXQgUHVtcDogSm9obiBTbm93IGFuZCB0aGUgTXlzdGVyeSBvZiBDaG9sZXJhKi4gRmlyc3QgZWRpdGlvbi4gQmVya2VsZXk6IFVuaXZlcnNpdHkgb2YgQ2FsaWZvcm5pYSBQcmVzcy4KKyAqKlR1ZnRlJ3MgY2xhc3NpYyBkaXNjdXNzaW9uIG9mIFNub3cncyBtYXBwaW5nKiogKGEgdG9waWMgSSBkb24ndCBjb3ZlciBoZXJlKTogVHVmdGUsIEVkd2FyZCBSLiAxOTk3LiAqVmlzdWFsIEV4cGxhbmF0aW9uczogSW1hZ2VzIGFuZCBRdWFudGl0aWVzLCBFdmlkZW5jZSBhbmQgTmFycmF0aXZlKi4gMXN0IGVkaXRpb24uIEdyYXBoaWNzIFByZXNzLgorICoqQmlvZ3JhcGh5Kio6IFZpbnRlbi1Kb2hhbnNlbiwgUGV0ZXIsIEhvd2FyZCBCcm9keSwgTmlnZWwgUGFuZXRoLCBTdGVwaGVuIFJhY2htYW4sIGFuZCBNaWNoYWVsIFJ1c3NlbGwgUmlwLiAyMDAzLiAqQ2hvbGVyYSwgQ2hsb3JvZm9ybSBhbmQgdGhlIFNjaWVuY2Ugb2YgTWVkaWNpbmU6IEEgTGlmZSBvZiBKb2huIFNub3cqLiBPeGZvcmQ7IE5ldyBZb3JrOiBPeGZvcmQgVW5pdmVyc2l0eSBQcmVzcy4gTGlua2VkIG9uLWxpbmUgcmVzb3VyY2VzIGh0dHBzOi8vam9obn