---
title: "Data Sources & Provenance"
format:
  html:
    toc: true
    toc-expand: 2
    toc-location: left
    code-fold: true
    code-summary: "Show code"
vignette: >
  %\VignetteIndexEntry{Data Sources & Provenance}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
#| echo: false
#| results: asis
in_pkgdown <- nzchar(Sys.getenv("IN_PKGDOWN"))
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.width = 8,
  fig.height = 6
)
if (!in_pkgdown) library(targets)

# Shared vignette utilities (safe_tar_read, show_target, render helpers)
utils_path <- system.file("vignette_utils.R", package = "coMMpass")
if (utils_path == "") {
  utils_path <- if (file.exists("../inst/vignette_utils.R")) "../inst/vignette_utils.R"
  else if (file.exists("inst/vignette_utils.R")) "inst/vignette_utils.R"
  else stop("Cannot find vignette_utils.R")
}
source(utils_path, local = TRUE)
```

```{r pkgdown-banner}
#| results: asis
#| eval: !expr in_pkgdown
#| echo: false
cat("::: {.callout-note}\n## Online documentation\nThis vignette displays pre-computed results. Run the targets pipeline locally for interactive analysis.\n:::\n")
```

## Overview

This vignette documents all data sources used in the coMMpass analysis
pipeline, their access tiers, and the distinction between **pipeline data**
(real patient data from GDC) and **synthetic test data** (generated by
`example_data()`).

## MMRF CoMMpass Study

The [Multiple Myeloma Research Foundation (MMRF)](https://themmrf.org/)
Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of
Genetic Profile (CoMMpass) study is a longitudinal observational study of
~1,143 newly diagnosed multiple myeloma patients.

- **ClinicalTrials.gov:** [NCT01454297](https://clinicaltrials.gov/study/NCT01454297)
- **GDC Project:** [MMRF-COMMPASS](https://portal.gdc.cancer.gov/projects/MMRF-COMMPASS)
- **MMRF Researcher Gateway:** <https://research.themmrf.org/> (not currently accessible — FISH, PFS, treatment response data unavailable)
- **dbGaP Controlled Access:** [phs000748](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000748) (not used by this pipeline — requires institutional IRB)

### Citation

> Keats JJ, et al. Interim Analysis Of The Mmrf CoMMpass Trial, a
> Longitudinal Study In Multiple Myeloma Relating Clinical Outcomes to
> Genomic and Immunophenotypic Profiles. *Blood*. 2013;122(21):532.
> [doi:10.1182/blood.V122.21.532.532](https://doi.org/10.1182/blood.V122.21.532.532)

## Data at a Glance

```{r data-at-glance}
#| echo: false
#| results: asis
show_target("vig_data_at_glance")
```

## Data Access Tiers

| Tier | Data | Access | Used by pipeline |
|------|------|--------|------------------|
| **Open Access** | RNA-seq counts (STAR), clinical metadata, treatment records | GDC Data Portal, no login required | **Yes** |
| **Open Access (S3)** | RNA-seq files | `s3://gdc-mmrf-commpass-phs000748-2-open/` | **Yes** |
| **MMRF Gateway** | FISH/cytogenetics, PFS, treatment response | Free registration (pending access) | **No** — 12 targets blocked |
| **Controlled Access** | WGS, WES, protected clinical | dbGaP application ([phs000748](...)) | **No** — requires institutional IRB |

## Pipeline Data Sources (Real)

The `targets` pipeline downloads real patient data from GDC:

| Data | Source | Function |
|------|--------|----------|
| RNA-seq SummarizedExperiment | [GDC STAR-Counts](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/) | `download_rnaseq_data()` via `TCGAbiolinks::GDCdownload()` |
| Clinical metadata (demographics, ISS, vital status) | [GDC clinical endpoint](https://portal.gdc.cancer.gov/projects/MMRF-COMMPASS) | `download_clinical_data()` via `TCGAbiolinks::GDCquery_clinic()` |
| Treatment records (7,184 records, 994 patients) | [GDC API](https://api.gdc.cancer.gov/) | `download_clinical_data()` via GDC REST API |
| Biospecimen metadata | [GDC biospecimen endpoint](https://portal.gdc.cancer.gov/projects/MMRF-COMMPASS) | `download_clinical_data()` via `TCGAbiolinks::GDCquery_clinic()` |
| MSigDB gene sets (Hallmark + KEGG) | [MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/) | Pre-bundled parquet in `inst/extdata/msigdb/` |

The pipeline runs with a configurable `sample_limit` parameter (default 200
in local builds; CI uses 20). All vignettes load pre-computed results from
the targets store.

### GDC Documentation

- **GDC Data Portal:** <https://portal.gdc.cancer.gov/>
- **GDC API:** <https://api.gdc.cancer.gov/>
- **GDC Data Dictionary:** <https://docs.gdc.cancer.gov/Data_Dictionary/viewer/>
- **mRNA Analysis Pipeline:** <https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/>

## Synthetic Test Data (`example_data()`)

The `example_data()` function returns small **synthetic** datasets for
testing and documentation examples. These are randomly generated with
`set.seed(42)` and contain **no real patient data**.

| Dataset | Size | Description |
|---------|------|-------------|
| `rnaseq_se` | 50 genes x 20 samples | SummarizedExperiment with simulated counts |
| `clinical` | 20 patients | Simulated demographics and outcomes |
| `cytogenetic` | 20 patients | Simulated FISH markers and risk groups |
| `treatment` | ~40 rows | Simulated treatment lines (1-3 per patient) |

Stored in `inst/extdata/example/*.rds` and created by
`create_all_example_data()`.

**The two data tracks do not mix:** pipeline vignettes use real GDC data
via `tar_read()`; unit tests and `?example_data` examples use synthetic
data only.

## MSigDB Gene Sets

Pre-bundled MSigDB Hallmark gene sets are stored as parquet files in
`inst/extdata/msigdb/`. These are public reference gene sets used for
pathway enrichment analysis.

### Bundled Files

```{r msigdb-stats}
#| echo: false
#| results: asis
show_target("vig_msigdb_stats")
```

### Citation

> Liberzon A, et al. The Molecular Signatures Database (MSigDB) Hallmark
> Gene Set Collection. *Cell Systems*. 2015;1(6):417-425.
> [doi:10.1016/j.cels.2015.12.004](https://doi.org/10.1016/j.cels.2015.12.004)

- **MSigDB:** <https://www.gsea-msigdb.org/gsea/msigdb/>
- **R package:** [msigdbr](https://cran.r-project.org/package=msigdbr)

## Recent Changes

Recent project commits with lines added, files changed, and change categories.

```{r changelog}
#| echo: false
#| results: asis
show_target("vig_git_changelog")
```

## Reproducibility

<details>
<summary>Session Info (click to expand)</summary>

```{r session-info, eval=TRUE}
sessionInfo()
```

</details>
