---
title: "03. Exploratory Data Analysis"
format:
  html:
    toc: true
    toc-expand: 2
    toc-location: left
    code-fold: true
    code-summary: "Show code"
vignette: >
  %\VignetteIndexEntry{03. Exploratory Data Analysis}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
#| echo: false
#| results: asis
in_pkgdown <- nzchar(Sys.getenv("IN_PKGDOWN"))
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.width = 8,
  fig.height = 5
)
if (!in_pkgdown) library(targets)

# Shared vignette utilities (safe_tar_read, show_target, render helpers)
utils_path <- system.file("vignette_utils.R", package = "coMMpass")
if (utils_path == "") {
  utils_path <- if (file.exists("../inst/vignette_utils.R")) "../inst/vignette_utils.R"
  else if (file.exists("inst/vignette_utils.R")) "inst/vignette_utils.R"
  else stop("Cannot find vignette_utils.R")
}
source(utils_path, local = TRUE)
```

```{r pkgdown-banner}
#| results: asis
#| eval: !expr in_pkgdown
#| echo: false
cat("::: {.callout-note}\n## Online documentation\nThis vignette displays pre-computed results. Run the targets pipeline locally for interactive analysis.\n:::\n")
```

## Overview

> See the [Glossary](glossary.html) for term definitions used throughout this project.

- Patient demographics, biospecimen characteristics, and RNA-seq quality metrics
- DuckDB queries against underlying parquet files
- Cytogenetic alteration frequencies and co-occurrence patterns

All summaries are **pre-computed** as targets (`vig_*`). Placeholder messages
appear in place of plots when pre-computed outputs are not yet available.

## Clinical Demographics

```{r clinical-demographics, results='asis'}
#| echo: false
show_target("vig_clinical_demographics_text")
```

### Age at Diagnosis

```{r age-histogram, results='asis'}
#| echo: false
show_target("vig_age_histogram")
```

### Gender Distribution

```{r gender-table, results='asis'}
#| echo: false
show_target("vig_gender_table")
```

### Race Distribution

```{r race-table, results='asis'}
#| echo: false
show_target("vig_race_table")
```

### Vital Status

```{r vital-status, results='asis'}
#| echo: false
show_target("vig_vital_table")
```

## ISS Staging

The International Staging System (ISS) classifies multiple myeloma into three
stages based on serum beta-2 microglobulin and albumin levels. Higher stages
indicate more advanced disease and worse prognosis.

```{r iss-distribution, results='asis'}
#| echo: false
show_target("vig_iss_distribution_text")
```

### ISS Stage Distribution

```{r iss-table, results='asis'}
#| echo: false
show_target("vig_iss_table")
```

```{r iss-barplot, results='asis'}
#| echo: false
show_target("vig_iss_barplot")
```

### Age by ISS Stage

```{r iss-age, results='asis'}
#| echo: false
show_target("vig_iss_age_table")
```

## Survival Endpoints

```{r survival-endpoints, results='asis'}
#| echo: false
show_target("vig_survival_endpoints")
```

## Biospecimen Overview

```{r biospecimen-overview, results='asis'}
#| echo: false
show_target("vig_biospecimen_overview_text")
```

### Sample Types

```{r sample-type-bar}
#| echo: false
#| results: asis
show_target("vig_sample_type_barplot")
```

### Samples per Patient

```{r samples-per-patient}
#| echo: false
#| results: asis
show_target("vig_samples_per_patient_hist")
```

## RNA-seq Quality

```{r rnaseq-quality, results='asis'}
#| echo: false
show_target("vig_rnaseq_quality_text")
```

### Library Size Distribution

```{r library-size-boxplot}
#| echo: false
#| results: asis
show_target("vig_library_size_boxplot")
```

### Genes Detected per Sample

```{r genes-detected-histogram}
#| echo: false
#| results: asis
show_target("vig_genes_detected_hist")
```

## Sample Clustering (PCA)

Principal component analysis on the top 500 most variable genes (VST-transformed)
reveals sample-level structure. Points are colored by clinical covariates to check
for batch effects or biological groupings.

```{r pca-biplot}
#| echo: false
#| results: asis
show_target("vig_pca_biplot")
```

```{r pca-biplot-vital}
#| echo: false
#| results: asis
show_target("vig_pca_biplot_vital")
```

## Cytogenetic Landscape

Cytogenetic markers from FISH testing are key to myeloma risk stratification
([IMWG 2014 criteria](https://doi.org/10.1200/JCO.2014.55.1519)).
High-risk alterations include t(4;14), t(14;16), del(17p), and gain(1q).

### Alteration Frequencies

```{r cyto-frequency, results='asis'}
#| echo: false
show_target("vig_cyto_frequency_table")
```

### Oncoprint

```{r cyto-oncoprint, fig.height = 5}
#| echo: false
#| results: asis
show_target("vig_cyto_oncoprint")
```

### Co-occurrence Analysis

```{r cyto-cooccurrence, results='asis'}
#| echo: false
show_target("vig_cyto_cooccurrence_table")
```

```{r cyto-cooccurrence-heatmap}
#| echo: false
#| results: asis
show_target("vig_cyto_cooccurrence_heatmap")
```

Cytogenetic markers are used for [survival stratification](survival-analysis.html) and inform [differential expression](differential-expression.html) analysis.

## DuckDB Query Examples

The package provides two functions for querying the parquet files:

- `query_commpass_parquet()` -- one-shot query that returns a data frame
- `get_commpass_tbl()` -- returns a lazy `dplyr::tbl()` for chained operations

### Example: Simple Query

```{r eda-simple-query-code, results='asis'}
safe_tar_read("code_eda_simple_query")
```

```{r duckdb-demo-simple, results='asis'}
#| echo: false
show_target("vig_duckdb_demo_simple")
```

### Example: Aggregation with DuckDB

```{r eda-aggregation-code, results='asis'}
safe_tar_read("code_eda_aggregation")
```

```{r duckdb-demo-agg, results='asis'}
#| echo: false
show_target("vig_duckdb_demo_agg")
```

## Cross-dataset Integration

```{r integration, results='asis'}
#| echo: false
show_target("vig_integration_text")
```

## Clinical-Informed Normalisation

Standard z-scoring treats all variation equally, but for clinical biomarkers
this compresses the clinically important normal range and expands the
uninformative extreme range. The SCOPE method
([Hussain et al. 2024](https://doi.org/10.1038/s41746-024-01189-3)) uses
clinical reference ranges to normalise biomarkers, preserving clinically
meaningful variation.

### Reference Ranges

```{r reference-ranges}
#| echo: false
#| results: asis
show_target("vig_reference_ranges")
```

### Clinical vs Z-Score Comparison

```{r normalisation-comparison}
#| echo: false
#| results: asis
show_target("vig_normalisation_comparison")
```

## Missingness as Signal

Missing data in clinical datasets is rarely random. In multiple myeloma,
FISH testing is expensive and may be selectively ordered for patients with
suspected high-risk disease. If missingness correlates with survival, it
suggests the data is **missing not at random (MNAR)** — the missingness
itself is clinically informative.

### Variable Missingness

```{r missingness-heatmap, fig.height = 10}
#| echo: false
#| results: asis
show_target("vig_missingness_heatmap")
```

```{r missingness-table}
#| echo: false
#| results: asis
show_target("vig_missingness_summary_table")
```

### Does Missingness Predict Survival?

Cox regression testing whether patients with missing values for each
variable have different survival outcomes. A significant HR suggests
the missingness mechanism is informative (MNAR).

```{r missingness-survival}
#| echo: false
#| results: asis
show_target("vig_missingness_vs_survival_table")
```

## Data Sources

Results in this vignette are derived from the
[MMRF CoMMpass study](https://portal.gdc.cancer.gov/projects/MMRF-COMMPASS)
(MMRF-COMMPASS, ~1,143 patients), downloaded via
[TCGAbiolinks](https://bioconductor.org/packages/TCGAbiolinks/).
The pipeline runs with a configurable `sample_limit` (default 200; CI uses 20).

For full citations, data access tiers, and the distinction between
pipeline data and synthetic test data, see the
[Data Sources](data-sources.html) vignette.

## Recent Changes

Recent project commits with lines added, files changed, and change categories.

```{r changelog}
#| echo: false
#| results: asis
show_target("vig_git_changelog")
```

## Reproducibility

<details>
<summary>Git Commit Info (click to expand)</summary>

```{r git-info, results='asis'}
#| echo: false
show_target("vig_git_info")
```

</details>

<details>
<summary>Session Info (click to expand)</summary>

```{r session-info, eval=TRUE}
sessionInfo()
```

</details>
