---
title: "02. Data Dictionary"
format:
  html:
    toc: true
    toc-expand: 2
    toc-location: left
    code-fold: true
    code-summary: "Show code"
vignette: >
  %\VignetteIndexEntry{02. Data Dictionary}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
#| echo: false
#| results: asis
in_pkgdown <- nzchar(Sys.getenv("IN_PKGDOWN"))
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.width = 8,
  fig.height = 5
)
if (!in_pkgdown) library(targets)

# Shared vignette utilities (safe_tar_read, show_target, render helpers)
utils_path <- system.file("vignette_utils.R", package = "coMMpass")
if (utils_path == "") {
  utils_path <- if (file.exists("../inst/vignette_utils.R")) "../inst/vignette_utils.R"
  else if (file.exists("inst/vignette_utils.R")) "inst/vignette_utils.R"
  else stop("Cannot find vignette_utils.R")
}
source(utils_path, local = TRUE)
```

```{r pkgdown-banner}
#| results: asis
#| eval: !expr in_pkgdown
#| echo: false
cat("::: {.callout-note}\n## Online documentation\nThis vignette displays pre-computed results. Run the targets pipeline locally for interactive analysis.\n:::\n")
```

## Overview

> See the [Glossary](glossary.html) for term definitions and [units reference](glossary.html#units-reference) used throughout this project.

- Describes all variables in the MMRF CoMMpass dataset from the [GDC](https://portal.gdc.cancer.gov/projects/MMRF-COMMPASS)
- CoMMpass: longitudinal observational study of ~1,000 newly diagnosed multiple myeloma patients
- Variables span clinical demographics, biospecimen metadata, and RNA-seq gene expression

**Data sources:**

- [GDC Portal - MMRF-COMMPASS](https://portal.gdc.cancer.gov/projects/MMRF-COMMPASS)
- [GDC Data Dictionary](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/)
- [MMRF CoMMpass Study](https://themmrf.org/research/commpass-study/)

## Complete Data Dictionary

Searchable table of all variables available in the CoMMpass dataset with types, categories, and descriptions.

```{r dictionary-table}
#| echo: false
#| results: asis
show_target("vig_dd_dictionary_table")
```

## Clinical Data

- Patient demographics, disease characteristics, and outcomes
- Downloaded via `TCGAbiolinks::GDCquery_clinic()`
- All time/age variables in **days** (see [units reference](glossary.html#units-reference))

### Sample Rows

```{r clinical-preview}
#| echo: false
#| results: asis
show_target("vig_clinical_preview")
```

```{r clinical-summary-text, results='asis'}
#| echo: false
show_target("vig_dd_clinical_text")
```

```{r clinical-summary-table}
#| echo: false
#| results: asis
show_target("vig_dd_clinical_table")
```

## Biospecimen Data

- Tissue samples collected from patients
- Includes sample type, tissue type, and preservation method

```{r biospecimen-summary-text, results='asis'}
#| echo: false
show_target("vig_dd_biospecimen_text")
```

```{r biospecimen-summary-table}
#| echo: false
#| results: asis
show_target("vig_dd_biospecimen_table")
```

## RNA-seq Data

- RNA-seq gene expression generated using the STAR aligner
- Quantified against GENCODE annotations
- Available in multiple quantification types (raw counts, TPM, FPKM)
- See [units reference](glossary.html#units-reference) for quantification units

```{r rnaseq-dims-text, results='asis'}
#| echo: false
show_target("vig_dd_rnaseq_text")
```

```{r rnaseq-biotype-table}
#| echo: false
#| results: asis
show_target("vig_dd_rnaseq_biotype_table")
```

```{r rnaseq-sample-metadata}
#| echo: false
#| results: asis
show_target("vig_dd_rnaseq_sample_table")
```

## Treatment Data

- Treatment records extracted from GDC API (7,184 records, 994 patients)
- Each row = one drug administration for one patient
- Includes therapeutic agent, line of therapy, and timing

### Sample Rows

```{r treatment-preview}
#| echo: false
#| results: asis
show_target("vig_treatment_preview")
```

## Column Name Mappings

GDC uses standardized column names. Common aliases used in analysis:

```{r column-mappings}
#| echo: false
#| results: asis
show_target("vig_dd_column_mappings")
```

## Units Reference

For the complete units reference table (age/time in days, expression quantification types), see the [Glossary units reference](glossary.html#units-reference).

Key rule: **All age and time variables are stored in DAYS by GDC.** This is the single most common source of errors in GDC analyses.

## Data Sources & Links

- **GDC Portal**: <https://portal.gdc.cancer.gov/projects/MMRF-COMMPASS>
- **GDC Data Dictionary**: <https://docs.gdc.cancer.gov/Data_Dictionary/viewer/>
- **GDC mRNA Analysis Pipeline**: <https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/>
- **MMRF CoMMpass Study**: <https://themmrf.org/research/commpass-study/>
- **GENCODE**: <https://www.gencodegenes.org/>
- **Ensembl**: <https://www.ensembl.org/>

For exploratory analysis using these variables, see the [EDA vignette](exploratory-analysis.html).

## R Code Examples

### Loading Data

How to load CoMMpass data from parquet files and query with DuckDB.

```{r dd-load-data-code, results='asis'}
safe_tar_read("code_dd_load_data")
```

### Exploring the Dictionary

Programmatic access to variable documentation and metadata.

```{r dd-explore-dict-code, results='asis'}
safe_tar_read("code_dd_explore_dict")
```

## Data Sources

Results in this vignette are derived from the
[MMRF CoMMpass study](https://portal.gdc.cancer.gov/projects/MMRF-COMMPASS)
(MMRF-COMMPASS, ~1,143 patients), downloaded via
[TCGAbiolinks](https://bioconductor.org/packages/TCGAbiolinks/).
The pipeline runs with a configurable `sample_limit` (default 200; CI uses 20).

For full citations, data access tiers, and the distinction between
pipeline data and synthetic test data, see the
[Data Sources](data-sources.html) vignette.

## Recent Changes

Recent project commits with lines added, files changed, and change categories.

```{r changelog}
#| echo: false
#| results: asis
show_target("vig_git_changelog")
```

## Reproducibility

<details>
<summary>Session Info (click to expand)</summary>

```{r session-info, eval=TRUE}
sessionInfo()
```

</details>
