---
title: "01. Data Acquisition & Status"
format:
  html:
    toc: true
    toc-expand: 2
    toc-location: left
    code-fold: true
    code-summary: "Show code"
vignette: >
  %\VignetteIndexEntry{01. Data Acquisition & Status}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
#| echo: false
#| results: asis
in_pkgdown <- nzchar(Sys.getenv("IN_PKGDOWN"))
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.width = 8,
  fig.height = 5
)
if (!in_pkgdown) library(targets)

# Shared vignette utilities (safe_tar_read, show_target, render helpers)
utils_path <- system.file("vignette_utils.R", package = "coMMpass")
if (utils_path == "") {
  utils_path <- if (file.exists("../inst/vignette_utils.R")) "../inst/vignette_utils.R"
  else if (file.exists("inst/vignette_utils.R")) "inst/vignette_utils.R"
  else stop("Cannot find vignette_utils.R")
}
source(utils_path, local = TRUE)
```

```{r pkgdown-banner}
#| results: asis
#| eval: !expr in_pkgdown
#| echo: false
cat("::: {.callout-note}\n## Online documentation\nThis vignette displays pre-computed results. Run the targets pipeline locally for interactive analysis.\n:::\n")
```

## Overview

> See the [Glossary](glossary.html) for term definitions (read count, library size, gene detection, outlier, etc.).

- RNA-seq and clinical data from [GDC](https://portal.gdc.cancer.gov/projects/MMRF-COMMPASS) via the [`targets` pipeline](https://github.com/JohnGavin/coMMpass-analysis/tree/main/R/tar_plans)
- Open Access RNA-seq from the GDC S3 bucket (`s3://gdc-mmrf-commpass-phs000748-2-open/`)
- Clinical metadata via `TCGAbiolinks::GDCquery_clinic()`

## Data Flow

```{mermaid}
%%| fig-cap: "Simplified data flow from GDC to analysis outputs."
flowchart LR
  subgraph GDC["GDC Open Access"]
    R["RNA-seq<br/>STAR Counts"]
    C["Clinical<br/>Metadata"]
    T["Treatment<br/>Records"]
  end

  subgraph Clean["Data Cleaning"]
    SE["Summarized<br/>Experiment"]
    CD["Clinical<br/>DataFrame"]
    TD["Treatment<br/>DataFrame"]
  end

  subgraph Analysis
    DE["Differential<br/>Expression"]
    KM["Survival<br/>Analysis"]
    PA["Pathway<br/>Enrichment"]
    DAG["Causal<br/>DAGs"]
  end

  R --> SE
  C --> CD
  T --> TD
  SE --> DE
  CD --> KM
  TD --> KM
  DE --> PA
  SE --> KM
  CD --> DAG

  style GDC fill:#e8f5e9,stroke:#4CAF50
  style Clean fill:#e3f2fd,stroke:#2196F3
  style Analysis fill:#fce4ec,stroke:#F44336
```

## Pipeline Configuration

Current pipeline settings including sample limit, random seed, and data paths.

```{r config, results='asis'}
#| echo: false
show_target("vig_config_text")
```

> **Note:** This vignette was built with `sample_limit = `r { x <- safe_tar_read("config"); if (!is.null(x)) x$sample_limit else "?" }`` patients. Numbers below reflect this subset, not the full ~900-patient CoMMpass cohort. For pipeline execution details, see [issue #46 (telemetry vignette)](https://github.com/JohnGavin/coMMpass-analysis/issues/46).

## RNA-seq Data {#rnaseq-data}

RNA-seq gene expression data is downloaded from GDC as a [SummarizedExperiment](https://bioconductor.org/packages/SummarizedExperiment/) object containing [STAR-Counts](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/).

```{r rnaseq-summary-text, results='asis'}
#| echo: false
show_target("vig_rnaseq_summary_text")
```

```{r rnaseq-sample-table}
#| echo: false
#| results: asis
show_target("vig_rnaseq_sample_stats")
```

```{r count-distribution, results='asis'}
#| echo: false
show_target("vig_count_distribution_plot")
```

## Clinical Data

Clinical metadata from [GDC](https://portal.gdc.cancer.gov/projects/MMRF-COMMPASS) provides patient demographics, disease characteristics, and outcomes. See the [data dictionary](data-dictionary.html) for variable definitions and the [glossary](glossary.html) for units.

```{r clinical-columns}
#| echo: false
#| results: asis
show_target("vig_clinical_column_info")
```

### Data Completeness

Missing data rates across clinical variables, highlighting fields with substantial missingness.

```{r missing-data-text, results='asis'}
#| echo: false
show_target("vig_missing_data_text")
```

```{r missing-data-table}
#| echo: false
#| results: asis
show_target("vig_missing_data_table")
```

## Quality Control {#quality-control}

Quality control identifies outlier samples based on [library size](glossary.html) and [gene detection](glossary.html) rate. See [`R/02_quality_control.R`](https://github.com/JohnGavin/coMMpass-analysis/blob/main/R/02_quality_control.R) for filtering criteria.

The per-sample expression summary (Total Counts, Genes Detected) is shown in the [RNA-seq Data](#rnaseq-data) section above. This section adds QC-specific metrics: size factors, count dispersion (MAD), and outlier flags.

**Variable definitions** (see also [Glossary](glossary.html)):

- **Total Counts** --- [library size](glossary.html): sum of all [read counts](glossary.html) mapped to genes in one sample
- **Detected Genes** --- number of genes with at least 1 mapped [read](glossary.html) (count > 0)
- **Median Count** --- median read count across all genes in a sample (most genes have 0 or low counts, so this is typically 0)
- **MAD Count** --- [median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) of gene counts within a single sample (measures spread of the count distribution for that sample)
- **Size Factor** --- library size / median library size across all samples (values near 1.0 = typical; <<1 = under-sequenced; >>1 = over-sequenced)
- **Outlier** --- flagged `Yes` if the sample falls in the **bottom 5th percentile** of either library size OR genes detected

```{r qc-text, results='asis'}
#| echo: false
show_target("vig_qc_text")
```

```{r qc-table}
#| echo: false
#| results: asis
show_target("vig_qc_table")
```

```{r filtered-summary, results='asis'}
#| echo: false
show_target("vig_filtered_summary")
```

## Data Sources

Results in this vignette are derived from the
[MMRF CoMMpass study](https://portal.gdc.cancer.gov/projects/MMRF-COMMPASS)
(MMRF-COMMPASS, ~1,143 patients), downloaded via
[TCGAbiolinks](https://bioconductor.org/packages/TCGAbiolinks/).
The pipeline runs with a configurable `sample_limit` (default 200; CI uses 20).

For full citations, data access tiers, and the distinction between
pipeline data and synthetic test data, see the
[Data Sources](data-sources.html) vignette.

## Recent Changes

Recent project commits with lines added, files changed, and change categories.

```{r changelog}
#| echo: false
#| results: asis
show_target("vig_git_changelog")
```

## Reproducibility

<details>
<summary>Git Commit Info (click to expand)</summary>

```{r git-info, results='asis'}
#| echo: false
show_target("vig_git_info")
```

</details>

<details>
<summary>Session Info (click to expand)</summary>

```{r session-info, eval=TRUE}
sessionInfo()
```

</details>
