---
title: "Confounding Variables in Risk Data"
format:
  html:
    code-fold: true
    code-summary: "Show code"
vignette: >
  %\VignetteIndexEntry{Confounding Variables in Risk Data}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = ""
)

if (requireNamespace("micromort", quietly = TRUE)) {
  library(micromort)
} else {
  pkgload::load_all(quiet = TRUE)
}

# Shared safe_tar_read with RDS fallback (see inst/vignette_utils.R)
source(file.path(tryCatch(rprojroot::find_root(rprojroot::is_r_package),
  error = function(e) "."), "inst", "vignette_utils.R"))
```

Population-average risk statistics can be dangerously misleading when **confounding variables** — unmeasured or ignored factors that correlate with both exposure and outcome — drive most of the variation. This vignette illustrates how conditioning on the right variable can change a risk estimate by orders of magnitude.

For data quality criteria and denominator problems that motivate this analysis, see the [Data Quality section](introduction.html#data-quality-the-denominator-problem) of the introduction. For the conditional risk functions used throughout this package, see [Conditional Risks](introduction.html#conditional-risks-cancer-vaccination-and-risk-hedging).

## 1. Simpson's Paradox in Risk Data

**Simpson's paradox** occurs when a trend that appears in aggregated data reverses or disappears when the data is split by a confounding variable. Risk data is especially vulnerable because:

- **Exposure varies by subgroup**: Not everyone faces the same hazard equally
- **Susceptibility varies by subgroup**: Age, genetics, and occupation change vulnerability
- **Reporting conflates subgroups**: A single "micromorts per year" figure averages across vastly different populations

The result: a population-average micromort value may describe *nobody* accurately.

## 2. Flagship Example: Bed Falls (Age as Confounder)

Falling out of bed kills ~450 Americans per year ([CPSC](https://www.cpsc.gov/Newsroom/News-Releases/2022/Older-Americans-Are-More-Likely-to-Suffer-Fatalities-from-Falls-and-Fire-CPSC-Report-Highlights-Hidden-Hazards-Around-the-Home)). The population average is **1.36 micromorts/year**. But age is a massive confounder — the [CDC age-stratified data](https://www.cdc.gov/nchs/products/databriefs/db532.htm) reveals a **2,500-fold** difference:

| Age group | Sex | Fall deaths per 100,000/year | Micromorts per night |
|-----------|-----|------------------------------|----------------------|
| Under 65 | Both | ~0.4 | **0.004** |
| 65--74 | Male | 24.7 | **0.68** |
| 65--74 | Female | 14.2 | **0.39** |
| 85+ | Male | 373.3 | **10.2** |
| 85+ | Female | 319.7 | **8.8** |

### What 10 micromorts per night means

For an 85-year-old man, going to bed carries ~10 micromorts — comparable to:

- Riding a motorcycle 60 miles (10 micromorts/trip)
- A single ecstasy dose (13 micromorts/dose)
- A day of skiing (0.7 micromorts/day) repeated **14 times**

For someone under 65, the same activity carries 0.004 micromorts per night — essentially zero. The population average of 1.36/year describes neither group accurately.

### The confounding mechanism

Age confounds the bed fall risk through two pathways:

1. **Fragility**: Older adults have lower bone density, slower reflexes, and higher complication rates from identical falls
2. **Bed type and environment**: Hospital beds, care home beds, and medication-induced drowsiness increase fall frequency in older populations

Neither pathway is captured by the aggregate statistic.

## 3. Further Examples

### 3.1 Bee and wasp stings (allergy as confounder)

Bee and wasp stings kill 72 Americans per year ([CDC MMWR](https://www.cdc.gov/mmwr/volumes/68/wr/mm6829a5.htm)). The population rate is 0.22 micromorts/year. But nearly all fatalities are among the ~1% with venom allergy ([JACI, 2015](https://doi.org/10.1016/j.jaci.2015.07.017)):

| Subgroup | Prevalence | Risk per sting |
|----------|------------|----------------|
| No allergy (~99%) | 327M people | **~0 micromorts** |
| Venom allergy (~1%) | 3.3M people | **~22 micromorts/sting** |

The confounder (allergy status) is binary and creates an extreme bimodal distribution. The population average — 0.22 micromorts/year — is meaningless for both groups.

### 3.2 Cow trampling (occupation as confounder)

Cattle kill 22 Americans per year ([CDC MMWR](https://www.cdc.gov/mmwr/volumes/64/wr/mm6446a8.htm)). Population rate: 0.07 micromorts/year. But exposure is concentrated among ~2.9 million cattle workers:

| Subgroup | Population | Micromorts/year |
|----------|------------|-----------------|
| General public | ~328M | **~0** |
| Cattle farmers | ~2.9M | **~7.5** |

That's a **100-fold** difference. Occupation is the confounder: it determines both exposure frequency (daily cattle handling vs never) and risk magnitude (confined spaces, agitated animals, kick zones).

### 3.3 Lightning strike (outdoor work as confounder)

Lightning kills 28 Americans per year ([NOAA](https://www.weather.gov/safety/lightning-fatalities)). Population rate: 0.08 micromorts/year. But outdoor agricultural workers face ~15x the risk:

| Subgroup | Micromorts/year |
|----------|-----------------|
| Indoor worker | **~0.02** |
| Outdoor recreational | **~0.3** |
| Outdoor agricultural worker | **~1.2** |

The confounders are occupation and behaviour: time spent outdoors, in open fields, near tall objects, and during storm season.

### 3.4 Drowning (age and setting as confounders)

Drowning kills ~4,000 Americans per year ([CDC](https://www.cdc.gov/drowning/data/index.html)). The population rate is ~12 micromorts/year. But the risk distribution is bimodal:

| Subgroup | Drowning rate per 100,000/year |
|----------|-------------------------------|
| Children 1--4 | **7.6** |
| Adults 25--64 | **1.2** |
| Males (all ages) | **3.5** |
| Females (all ages) | **0.8** |

Age and sex are strong confounders. Setting matters too: swimming pools (children), natural water (adults), and bathtubs (elderly, alcohol-related) each have distinct risk profiles that the aggregate hides.

## 4. Recognising Confounders in Risk Data

A confounding variable must satisfy two conditions:

1. **Correlated with exposure**: The confounder determines who is exposed (e.g., farmers are exposed to cattle; office workers are not)
2. **Correlated with outcome**: The confounder affects the probability of death given exposure (e.g., age affects fall mortality; allergy affects sting mortality)

### Warning signs of confounded risk data

| Warning sign | Example |
|-------------|---------|
| Risk applies to "general population" | Cow trampling at 0.07 micromorts/year |
| Denominator is "per year" for an activity not everyone does | Horse riding at 0.5 micromorts/ride conflated with per-year |
| No age stratification | Bed fall deaths without age breakdown |
| No occupational stratification | Lightning deaths without indoor/outdoor split |
| Dramatic differences between sources | Different studies report 10x different values for the same activity |

### What to do

When you encounter a population-average risk:

1. **Ask "conditional on what?"** — identify the most likely confounders (age, sex, occupation, geography, pre-existing conditions)
2. **Seek stratified data** — government agencies (CDC, CPSC, NOAA) often publish age- and sex-stratified breakdowns
3. **Calculate conditional rates** — use `conditional_risk()` from this package to compare hedged vs unhedged scenarios
4. **Report the range, not the average** — a range like "0.004--10.2 micromorts/night depending on age" is more informative than "1.36 micromorts/year"

## 5. Geographic Confounding: Snake Bites

Geography is arguably the most powerful confounder in risk data. The same encounter — a snake bite — has vastly different outcomes depending on location:

```{r snake-geography}
#| echo: false
show_target("vig_confounding_snake_geography",
  caption = "Snake bite micromorts by geography. 37x difference US vs rural sub-Saharan Africa driven by antivenom access and hospital proximity. Source: atomic_risks(), WHO/CDC.")
```

The 37x difference between the US and rural sub-Saharan Africa reflects differences in antivenom availability, hospital proximity, and emergency transport infrastructure — not differences in snake venom potency. A population-average snake bite risk that blends these geographies would be misleading for everyone: too high for Americans, too low for rural Africans.

The same pattern applies to dog bites (24x difference driven by rabies PEP availability). For more on the systematic framework behind these geographic estimates, see the [Data Reliability](data_reliability.html) vignette.

## 6. Geography of Disease: How Country Reshuffles Daily Risk

The leading causes of death vary dramatically by country — and not just for infectious diseases. Using [IHME Global Burden of Disease 2023](https://ourworldindata.org/what-do-people-die-from-in-different-countries) age-standardised death rates, we can express chronic disease mortality as daily micromorts.

### Disease death rates by country

```{r disease-by-country}
#| echo: false
show_target("vig_confounding_disease_by_country",
  caption = "Daily micromorts by disease cause and country (IHME Global Burden of Disease 2023). Cardiovascular disease (CVD) leads everywhere; diarrheal disease shows 68x UK-India gap driven by sanitation infrastructure. Source: atomic_risks() Parts 11a-11d.")
```

```{r disease-dynamic, echo=FALSE}
disease_data <- safe_tar_read("vig_confounding_disease_by_country")
if (!is.null(disease_data)) {
  uk_cvd <- disease_data[disease_data$activity_id == "daily_cvd_mortality", "UK"][[1]]
  in_cvd <- disease_data[disease_data$activity_id == "daily_cvd_mortality", "IN"][[1]]
  cvd_ratio <- round(in_cvd / uk_cvd, 1)
  uk_diarrheal <- disease_data[disease_data$activity_id == "daily_diarrheal_mortality", "UK"][[1]]
  in_diarrheal <- disease_data[disease_data$activity_id == "daily_diarrheal_mortality", "IN"][[1]]
  diarrheal_ratio <- round(in_diarrheal / uk_diarrheal)
  uk_cancer <- disease_data[disease_data$activity_id == "daily_cancer_mortality", "UK"][[1]]
  in_cancer <- disease_data[disease_data$activity_id == "daily_cancer_mortality", "IN"][[1]]
} else {
  uk_cvd <- 3.24; in_cvd <- 7.65; cvd_ratio <- 2.4
  uk_diarrheal <- 0.02; in_diarrheal <- 1.28; diarrheal_ratio <- 68
  uk_cancer <- 3.95; in_cancer <- 2.30
}
```

Key findings:

- **[Cardiovascular disease (CVD)](https://www.who.int/health-topics/cardiovascular-diseases)** is the leading killer everywhere, at `r uk_cvd` mm/day in the UK and `r in_cvd` mm/day in India (`r cvd_ratio`x ratio). For context, a single skydive is 8 mm — less than 3 days of background <a href="https://www.who.int/health-topics/cardiovascular-diseases"><abbr title="Cardiovascular disease: diseases of the heart and blood vessels">CVD</abbr></a> risk in India.
- **Diarrheal diseases** show the starkest gap: `r uk_diarrheal` mm/day in the UK vs `r in_diarrheal` mm/day in India — a `r format(diarrheal_ratio, big.mark=",")`x difference driven by clean water and sanitation infrastructure.
- **Cancer** is *higher* in the UK (`r uk_cancer` mm/day) than India (`r in_cancer` mm/day) — a counterintuitive finding. As the [OWID data insight](https://ourworldindata.org/data-insights/richer-countries-dont-just-avoid-infectious-disease-they-also-have-lower-rates-of-chronic-disease-deaths) notes, richer countries avoid both infectious *and* chronic disease deaths, but cancer is the exception where age structure and screening detection inflate high-income rates.

### Top-15 ranking: UK vs Nigeria

```{r disease-whatif}
#| echo: false
show_target("vig_confounding_disease_whatif",
  caption = "Top-15 risk ranking: UK vs Nigeria profile (daily micromorts). Chronic diseases dominate both; diarrheal disease enters Nigeria's top-5 but is negligible in UK. Source: common_risks() with country profile filter.")
```

Switching from a UK to Nigeria profile reveals that chronic diseases dominate daily risk far more than any acute activity, and that the gap between countries is structural — not about individual behaviour.

## 7. Demographic What-If: How Age Reshuffles the Top 10

Population-average rankings can shift dramatically when conditioned on age. The `micromort` package now supports `condition_variable = "age"` for bed falls, elective anaesthesia, and bathing — activities where age is the dominant confounder.

### Bed falls by age group

```{r bed-fall-age}
#| echo: false
show_target("vig_confounding_bed_fall_age",
  caption = "Bed fall micromorts per night by age group. 2,550x range from under-65 (0.004 mm) to 85+ male (10.2 mm). Age is the dominant confounder. Source: atomic_risks() Part 10, CDC Data Brief 532.")
```

```{r bed-fall-dynamic, echo=FALSE}
# Fallback values (documented baseline: CDC Data Brief 532)
bed_85m_fallback <- 10.2
bed_u65_fallback <- 0.004
bed_ratio_fallback <- 2550

bed_age <- safe_tar_read("vig_confounding_bed_fall_age")
if (!is.null(bed_age)) {
  rows_85m <- dplyr::filter(bed_age, condition_value == "85_plus_male")
  rows_u65 <- dplyr::filter(bed_age, condition_value == "under_65")
  if (nrow(rows_85m) != 1L || nrow(rows_u65) != 1L) {
    cli::cli_abort(c(
      "x" = "Expected exactly one row per condition_value in {.code vig_confounding_bed_fall_age}.",
      "i" = "{.val 85_plus_male}: {nrow(rows_85m)} row(s), {.val under_65}: {nrow(rows_u65)} row(s)."
    ))
  }
  bed_85m <- dplyr::pull(rows_85m, micromorts)
  bed_u65 <- dplyr::pull(rows_u65, micromorts)
  bed_ratio <- round(bed_85m / bed_u65)
} else {
  bed_85m <- bed_85m_fallback
  bed_u65 <- bed_u65_fallback
  bed_ratio <- bed_ratio_fallback
}
```

For an 85-year-old man, a bed fall carries **`r bed_85m` micromorts per night** — `r format(bed_ratio, big.mark=",")`x the risk for someone under 65 (`r bed_u65` mm). This single activity, invisible in population-average rankings, becomes more dangerous than motorcycling.

### Top-15 ranking: default vs 85+ male

```{r demographic-whatif}
#| echo: false
show_target("vig_confounding_demographic_whatif",
  caption = "Top-15 risk ranking: population average vs 85+ male profile. Bed fall (10.2 mm/night) and anaesthesia (50 mm) enter the top-5 for the elderly; both are invisible in population-average rankings. Source: common_risks() with age profile filter.")
```

```{r whatif-dynamic, echo=FALSE}
# Fallback values (documented baseline: CDC Data Brief 532 / NHSLA)
anaes_default_fallback <- 2
anaes_aged_fallback <- 50
bath_default_fallback <- 0.07
bath_aged_fallback <- 0.5

whatif <- safe_tar_read("vig_confounding_demographic_whatif")
if (!is.null(whatif)) {
  .pull1 <- function(df, act, col, fallback = NA_real_) {
    rows <- dplyr::filter(df, activity == act)
    if (nrow(rows) != 1L) {
      # Row absent from cached RDS (pre-dates new activity names): use fallback
      return(fallback)
    }
    dplyr::pull(rows, !!rlang::sym(col))
  }
  anaes_default <- .pull1(whatif, "General anaesthesia (elective)", "default_mm", anaes_default_fallback)
  anaes_aged   <- .pull1(whatif, "General anaesthesia (elective)", "aged_mm", anaes_aged_fallback)
  bath_default <- .pull1(whatif, "Taking a bath (age-conditioned)", "default_mm", bath_default_fallback)
  bath_aged    <- .pull1(whatif, "Taking a bath (age-conditioned)", "aged_mm", bath_aged_fallback)
} else {
  anaes_default <- anaes_default_fallback
  anaes_aged    <- anaes_aged_fallback
  bath_default  <- bath_default_fallback
  bath_aged     <- bath_aged_fallback
}
```

Key shifts for an 85-year-old male:

- **Bed fall** enters the top 15 (absent from the default ranking entirely)
- **General anaesthesia (elective)** jumps from ~`r round(anaes_default, 0)` mm to `r round(anaes_aged, 0)` mm — a routine procedure becomes high-risk
- **Taking a bath** rises from `r round(bath_default, 2)` mm to `r round(bath_aged, 1)` mm
- Activities with no age conditioning (mountaineering, COVID-19) remain unchanged

This demonstrates why `common_risks(profile = list(age = "85_plus_male"))` gives a more honest risk picture for an elderly user than the population average.

## 8. Implications for the Micromort Package

This package addresses confounding in several ways:

- **Geographic conditioning** via `filter_by_profile(list(geography = "low_income"))` compares high- and low-income variants of the same risk

- **`conditional_risk()`** and **`hedged_portfolio()`** explicitly compare conditioned subgroups ([documentation](../reference/conditional_risk.html))
- **`cancer_risks()`** stratifies by sex, age group, and family history
- **`vaccination_risks()`** stratifies by age group and vaccine type
- **`regional_life_expectancy()`** stratifies by geography, capturing regional confounders
- **Data quality criteria** in the [Introduction](introduction.html#data-quality-the-denominator-problem) exclude risks with unknown denominators that would mask confounding

The general principle: **a micromort value is only as good as its denominator and conditioning variables**.

## References

- Spiegelhalter D (2012). "Using speed of ageing and 'microlives' to communicate the effects of lifetime habits and environment." *BMJ* 345:e8223. [doi:10.1136/bmj.e8223](https://doi.org/10.1136/bmj.e8223)
- CDC MMWR (2019). "Hymenoptera stings." [cdc.gov/mmwr/volumes/68/wr/mm6829a5.htm](https://www.cdc.gov/mmwr/volumes/68/wr/mm6829a5.htm)
- CDC Data Brief 532. "Deaths from unintentional falls." [cdc.gov/nchs/products/databriefs/db532.htm](https://www.cdc.gov/nchs/products/databriefs/db532.htm)
- CPSC (2022). "Older Americans Are More Likely to Suffer Fatalities from Falls." [cpsc.gov](https://www.cpsc.gov/Newsroom/News-Releases/2022/Older-Americans-Are-More-Likely-to-Suffer-Fatalities-from-Falls-and-Fire-CPSC-Report-Highlights-Hidden-Hazards-Around-the-Home)
- NOAA. "Lightning fatalities." [weather.gov/safety/lightning-fatalities](https://www.weather.gov/safety/lightning-fatalities)
- CDC. "Drowning data." [cdc.gov/drowning/data](https://www.cdc.gov/drowning/data/index.html)
- Golden DB et al. (2015). "Stinging insect hypersensitivity." *JACI* 135(6):1429--35. [doi:10.1016/j.jaci.2015.07.017](https://doi.org/10.1016/j.jaci.2015.07.017)

## Reproducibility

```{r confounding-session-info, eval=TRUE}
sessionInfo()
```

```{r build-info}
#| echo: false
#| results: asis
cat(safe_tar_read("vig_build_info") %||% "")
```
