---
title: "How Reliable Are These Numbers?"
format:
  html:
    code-fold: true
    code-summary: "Show code"
vignette: >
  %\VignetteIndexEntry{How Reliable Are These Numbers?}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = ""
)

if (requireNamespace("micromort", quietly = TRUE)) {
  library(micromort)
} else {
  pkgload::load_all(quiet = TRUE)
}
library(dplyr)

# Shared safe_tar_read with RDS fallback (see inst/vignette_utils.R)
source(file.path(tryCatch(rprojroot::find_root(rprojroot::is_r_package),
  error = function(e) "."), "inst", "vignette_utils.R"))
```

Risk numbers disagree. The WHO and the Institute for Health Metrics and Evaluation (IHME) report malaria deaths as 550,000 and 760,000 respectively — a 38% gap from the *same underlying deaths*. Our World in Data's [Deadliest Animals](https://ourworldindata.org/deadliest-animals) chart is visually compelling, but converting annual death counts to per-encounter micromorts is non-trivial. This vignette documents how we handle that uncertainty.

## 1. Why Risk Numbers Disagree

Three factors drive disagreement between sources:

1. **Numerator uncertainty**: Death attribution varies by coding system (ICD-10 codes, verbal autopsy, hospital records)
2. **Denominator uncertainty**: How many people were *exposed*? A "deaths per year" figure means nothing without knowing the exposure population
3. **Temporal and geographic aggregation**: A global annual average hides enormous regional and seasonal variation

Our inclusion criteria: **traceable numerator** + **defined denominator** + **reproducible calculation**. We reject risks where we cannot identify both the death count *and* the population at risk.

## 2. The Confidence System

Every entry in `atomic_risks()` carries a `confidence` tier:

```{r confidence-table}
#| echo: false
#| tbl-cap: "Confidence tiers with examples from the micromort dataset"
safe_tar_read("vig_reliability_confidence_tiers") |> knitr::kable()
```

### Validation status (new)

Within each confidence tier, we now track how thoroughly the estimate has been cross-checked:

```{r validation-table}
#| echo: false
#| tbl-cap: "Validation status levels"
safe_tar_read("vig_reliability_validation_defs") |> knitr::kable()
```

```{r validation-summary}
#| echo: false
#| tbl-cap: "Current validation status across all entries"
safe_tar_read("vig_reliability_validation_summary") |> knitr::kable()
```

## 3. Geographic and Health Profile Conditioning

Geography is the biggest source of variation in risk data — the same snake bite ranges from 0.5 mm (US, with antivenom) to 18.5 mm (rural sub-Saharan Africa) �� a 37x difference. Health profile conditioning shows similar magnitude: a bee sting is 0.03 mm for the general population but 31 mm for someone with a known allergy (1,000x).

For the full analysis of how geography and demographics reshuffle risk rankings, including disease mortality by country (<a href="https://www.healthdata.org/"><abbr title="Institute for Health Metrics and Evaluation">IHME</abbr></a> <a href="https://www.healthdata.org/research-analysis/gbd"><abbr title="Global Burden of Disease study">GBD</abbr></a> data) and age-conditioned confounders (bed falls, anaesthesia), see the [Confounding Variables](confounding.html) vignette.

The `common_risks()` function supports profile-based filtering:

```r
# Default: returns high-income, all-ages estimates
common_risks()

# Geographic and health profile conditioning
common_risks(profile = list(country = "NG"))
common_risks(profile = list(health_profile = "allergic"))
```

## 4. Cross-Validation Methods

We use five methods to assess data reliability:

### Source triangulation

Compare the same risk across independent sources. For wildlife risks, we cross-reference:

- **[OWID](https://ourworldindata.org/)** (Our World in Data) annual death counts (numerator)
- **[CDC](https://www.cdc.gov/)** (Centers for Disease Control and Prevention) injury surveillance (US denominator)
- **[WHO](https://www.who.int/)** (World Health Organization) fact sheets (global denominator)
- **[ISAF](https://www.floridamuseum.ufl.edu/shark-attacks/)** (International Shark Attack File) species-specific data

### Denominator audit

The most common failure mode. Does the source report **both** a numerator (deaths) **and** a denominator (exposures)?

| Animal | Numerator available? | Denominator available? | Included? |
|--------|---------------------|----------------------|-----------|
| Shark | Yes (ISAF) | Yes (~100M swims/yr) | Yes |
| Dog | Yes (CDC, WHO) | Yes (4.5M bites US) | Yes |
| Mosquito | Yes (WHO: 600k+) | No per-encounter rate | **No** |
| Crocodile | Yes (CrocBITE) | No exposure estimate | **No** |

### Temporal stability

Has the number changed significantly across editions of the source? Stable estimates across 5+ years increase confidence.

### Geographic consistency

Do US, UK, and global estimates agree within an order of magnitude? Large discrepancies suggest unmeasured confounders (see [Confounding Variables](confounding.html)).

### Order-of-magnitude test

Is the number physically plausible? A micromort value that implies more deaths than the population can support is a red flag.

## 5. Worked Example: Animal Risks from <a href="https://ourworldindata.org/"><abbr title="Our World in Data">OWID</abbr></a>

Our World in Data reports annual deaths by animal. Converting to per-encounter micromorts requires:

$$\text{micromorts} = \frac{\text{deaths per year}}{\text{encounters per year}} \times 10^6$$

```{r owid-conversion}
#| echo: false
#| tbl-cap: "Converting Our World in Data (OWID) annual counts to per-encounter micromorts"
safe_tar_read("vig_reliability_owid_conversion") |> knitr::kable()
```

Mosquito, crocodile, and elephant fail our inclusion criteria: there is no defensible per-encounter denominator. Mosquito bites are ubiquitous in endemic regions, making a per-bite risk meaningless. We cite <a href="https://ourworldindata.org/"><abbr title="Our World in Data">OWID</abbr></a> for context but do not include these as micromort entries.

## 6. Estimate Ranges

For wildlife entries, we document plausible ranges reflecting source disagreement:

```{r ranges}
#| echo: false
#| tbl-cap: "Estimate ranges for wildlife entries"
safe_tar_read("vig_reliability_ranges") |> knitr::kable(digits = 2)
```

The range reflects uncertainty in both the numerator (death counts vary by year and reporting) and denominator (exposure estimates are often rough). The point estimate is our best central value; the range brackets the plausible minimum and maximum.

## 7. What You Can Contribute

If you find a better source for an existing entry, or want to propose a new risk:
open an issue at [github.com/johngavin/micromort](https://github.com/johngavin/micromort/issues) with:

1. **Numerator**: Death count and source citation
2. **Denominator**: Exposure count and source citation
3. **Geography/condition**: Does the estimate apply globally, or to a specific population?
4. **Time period**: When was the data collected?

Entries start at `validation_status = "single_source"` and get upgraded as more sources confirm them.

### References

*   Spiegelhalter D (2009). "Micromorts." Plus Magazine. [plus.maths.org](https://plus.maths.org/os/issue55/features/risk/index)
*   Spiegelhalter D (2012). "Microlives." Plus Magazine. [plus.maths.org](https://plus.maths.org/content/understanding-uncertainty-microlives)
*   [micromorts.rip](https://micromorts.rip/) — curated micromort database
*   [Wikipedia: Micromort](https://en.wikipedia.org/wiki/Micromort)

## Reproducibility

```{r data_reliability-session-info, eval=TRUE}
sessionInfo()
```

```{r build-info}
#| echo: false
#| results: asis
cat(safe_tar_read("vig_build_info") %||% "")
```