| Title: | MMRF CoMMpass Data Analysis Pipeline |
|---|---|
| Description: | A reproducible data acquisition and analysis pipeline for the MMRF CoMMpass study using Targets and Nix. |
| Authors: | John Gavin [aut, cre] |
| Maintainer: | John Gavin <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-13 17:22:08 UTC |
| Source: | https://github.com/JohnGavin/coMMpass-analysis |
Orchestrates the download of CoMMpass data from various sources including RNA-seq data from GDC, clinical data, and optionally AWS data.
acquire_commpass_data( download_rnaseq = TRUE, download_clinical = TRUE, download_aws = FALSE, sample_limit = 200, random_sample = TRUE, seed = 42, use_parquet = TRUE )acquire_commpass_data( download_rnaseq = TRUE, download_clinical = TRUE, download_aws = FALSE, sample_limit = 200, random_sample = TRUE, seed = 42, use_parquet = TRUE )
download_rnaseq |
Whether to download RNA-seq data |
download_clinical |
Whether to download clinical data |
download_aws |
Whether to download from AWS |
sample_limit |
Limit number of samples (NULL for all) |
random_sample |
If TRUE, randomly sample patients |
seed |
Random seed for sampling |
use_parquet |
If TRUE, save parquet files |
List of file paths to the downloaded data
Other data-acquisition:
download_clinical_data(),
download_gdc_rnaseq(),
download_s3_subset(),
get_commpass_clinical(),
list_s3_commpass(),
query_commpass_rna()
## Not run: # Download only clinical data results <- acquire_commpass_data( download_rnaseq = FALSE, download_clinical = TRUE, download_aws = FALSE ) ## End(Not run)## Not run: # Download only clinical data results <- acquire_commpass_data( download_rnaseq = FALSE, download_clinical = TRUE, download_aws = FALSE ) ## End(Not run)
Add gene symbol column to DE results table
annotate_de_results(de_results, annotation)annotate_de_results(de_results, annotation)
de_results |
Data frame with Ensembl IDs as rownames |
annotation |
Data frame from annotate_genes() |
DE results with gene_symbol column prepended
Other pathway:
annotate_genes(),
plot_enrichment_barplot(),
plot_enrichment_dotplot(),
plot_gsea_running_score(),
run_gsea(),
run_ora(),
run_pathway_analysis()
Maps Ensembl gene IDs to HGNC symbols and Entrez IDs. Uses msigdbr gene mapping as the primary source (always available with msigdbr). Falls back to org.Hs.eg.db + AnnotationDbi if msigdbr is not installed.
annotate_genes(ensembl_ids)annotate_genes(ensembl_ids)
ensembl_ids |
Character vector of Ensembl gene IDs (versions stripped) |
Data frame with columns: ensembl_gene, gene_symbol, entrez_gene. Unmappable IDs have NA for symbol/entrez.
Other pathway:
annotate_de_results(),
plot_enrichment_barplot(),
plot_enrichment_dotplot(),
plot_gsea_running_score(),
run_gsea(),
run_ora(),
run_pathway_analysis()
Reads clinical data from the targets store and formats it for API delivery. Optionally filters by patient IDs or variables.
api_get_clinical(patient_ids = NULL, variables = NULL, store = "_targets")api_get_clinical(patient_ids = NULL, variables = NULL, store = "_targets")
patient_ids |
Optional character vector of patient IDs to filter |
variables |
Optional character vector of column names to select |
store |
Path to targets store (default: "_targets") |
A data frame of clinical data, or NULL if unavailable
Other api:
api_get_de_results(),
api_get_pathways(),
api_get_survival(),
api_list_datasets(),
api_serve(),
generate_api_endpoint(),
generate_api_index()
Reads differential expression results from the targets store. Returns the DESeq2 results table with gene symbols.
api_get_de_results(padj_threshold = 1, lfc_threshold = 0, store = "_targets")api_get_de_results(padj_threshold = 1, lfc_threshold = 0, store = "_targets")
padj_threshold |
Filter to genes with padj below this threshold (default: 1, i.e., no filter) |
lfc_threshold |
Filter to genes with absolute log2FC above this threshold (default: 0, i.e., no filter) |
store |
Path to targets store |
A data frame of DE results, or NULL if unavailable
Other api:
api_get_clinical(),
api_get_pathways(),
api_get_survival(),
api_list_datasets(),
api_serve(),
generate_api_endpoint(),
generate_api_index()
Reads GSEA results from the targets store.
api_get_pathways(significant_only = FALSE, store = "_targets")api_get_pathways(significant_only = FALSE, store = "_targets")
significant_only |
If TRUE, return only pathways with padj < 0.05 |
store |
Path to targets store |
A data frame of pathway results, or NULL if unavailable
Other api:
api_get_clinical(),
api_get_de_results(),
api_get_survival(),
api_list_datasets(),
api_serve(),
generate_api_endpoint(),
generate_api_index()
Reads prepared survival data from the targets store.
api_get_survival(store = "_targets")api_get_survival(store = "_targets")
store |
Path to targets store |
A data frame of survival data, or NULL if unavailable
Other api:
api_get_clinical(),
api_get_de_results(),
api_get_pathways(),
api_list_datasets(),
api_serve(),
generate_api_endpoint(),
generate_api_index()
Returns metadata about all datasets available through the API, including data type, row/column counts, and download formats.
api_list_datasets()api_list_datasets()
A data frame with columns: dataset, description, format, n_rows, n_cols, last_updated
Other api:
api_get_clinical(),
api_get_de_results(),
api_get_pathways(),
api_get_survival(),
api_serve(),
generate_api_endpoint(),
generate_api_index()
## Not run: api_list_datasets() ## End(Not run)## Not run: api_list_datasets() ## End(Not run)
Starts the plumber API server for programmatic data access. The API reads pre-computed results from the targets store.
api_serve(port = 8080L, host = "127.0.0.1", store = "_targets")api_serve(port = 8080L, host = "127.0.0.1", store = "_targets")
port |
Port number (default: 8080) |
host |
Host address (default: "127.0.0.1") |
store |
Path to targets store (default: "_targets") |
Invisible NULL (starts server)
Other api:
api_get_clinical(),
api_get_de_results(),
api_get_pathways(),
api_get_survival(),
api_list_datasets(),
generate_api_endpoint(),
generate_api_index()
## Not run: # Start the API server api_serve(port = 8080) # Then query from R or curl: # curl http://localhost:8080/datasets # curl http://localhost:8080/data/clinical?variables=submitter_id,gender ## End(Not run)## Not run: # Start the API server api_serve(port = 8080) # Then query from R or curl: # curl http://localhost:8080/datasets # curl http://localhost:8080/data/clinical?variables=submitter_id,gender ## End(Not run)
Computes Fisher's exact test for each pair of cytogenetic markers to identify significant co-occurrence or mutual exclusivity patterns.
calculate_cooccurrence(cyto_data, markers = NULL)calculate_cooccurrence(cyto_data, markers = NULL)
cyto_data |
Data frame from extract_cytogenetic_data() |
markers |
Character vector of marker columns. Default auto-detects. |
Data frame with columns: marker1, marker2, odds_ratio, pvalue, padj (BH-corrected), n_both, n_either, tendency (co-occurrence/exclusive/none)
Other cytogenetics:
compute_riss(),
extract_cytogenetic_data(),
plot_cooccurrence_heatmap(),
plot_cytogenetic_oncoprint(),
plot_expression_by_subtype(),
summarize_cytogenetics()
## Not run: cyto <- arrow::read_parquet("data/raw/clinical/cytogenetic_data.parquet") cooc <- calculate_cooccurrence(cyto) ## End(Not run)## Not run: cyto <- arrow::read_parquet("data/raw/clinical/cytogenetic_data.parquet") cooc <- calculate_cooccurrence(cyto) ## End(Not run)
Calculate QC metrics for RNA-seq data
calculate_qc_metrics(se_data)calculate_qc_metrics(se_data)
se_data |
A SummarizedExperiment object with a "counts" assay |
A data frame with QC metrics per sample
Checks whether the covariates used in a model match any of the DAG-implied minimal sufficient adjustment sets.
check_adjustment( model_covariates, exposure = "cytogenetic_risk", outcome = "overall_survival", dag = NULL )check_adjustment( model_covariates, exposure = "cytogenetic_risk", outcome = "overall_survival", dag = NULL )
model_covariates |
Character vector of covariates used in the model. |
exposure |
Character. Exposure variable. |
outcome |
Character. Outcome variable. |
dag |
A 'dagitty' DAG object. |
A list with '$sufficient' (logical), '$model_covariates', '$adjustment_sets', and '$recommendation'.
Check package dependencies
check_dependencies()check_dependencies()
Standardizes clinical data column names and formats
clean_clinical_data(clinical_raw)clean_clinical_data(clinical_raw)
clinical_raw |
Raw clinical data frame |
Cleaned clinical data frame
Other data-cleaning:
clean_expression_data(),
clean_treatment_data(),
integrate_clinical_expression(),
summarize_treatment()
Standardizes expression data format and adds metadata
clean_expression_data(expr_raw)clean_expression_data(expr_raw)
expr_raw |
Raw expression data (matrix or data frame) |
Cleaned expression matrix with gene names as rownames
Other data-cleaning:
clean_clinical_data(),
clean_treatment_data(),
integrate_clinical_expression(),
summarize_treatment()
Standardizes regimen names, derives regimen class, and creates an ordered response factor from raw CoMMpass treatment response data.
clean_treatment_data(trtresp_raw)clean_treatment_data(trtresp_raw)
trtresp_raw |
Data frame of raw treatment response data with columns including patient/submitter ID, treatment line, regimen description, best response, and transplant status |
Cleaned data frame with standardized columns: 'patient_id', 'treatment_line', 'regimen_name', 'regimen_class', 'best_response' (ordered factor), 'stem_cell_transplant' (logical), 'treatment_start_days' (integer)
Other data-cleaning:
clean_clinical_data(),
clean_expression_data(),
integrate_clinical_expression(),
summarize_treatment()
Encodes domain-knowledge causal assumptions for the CoMMpass multiple myeloma analysis. Nodes represent measured or latent variables; edges represent assumed causal effects.
commpass_dag()commpass_dag()
A 'dagitty::dagitty' DAG object
Compute PCA from transformed expression data
compute_pca(expr_matrix, n_top = 500L, metadata = NULL)compute_pca(expr_matrix, n_top = 500L, metadata = NULL)
expr_matrix |
Numeric matrix (genes x samples) of transformed expression values (e.g., VST or logCPM) |
n_top |
Number of most variable genes to use (default 500) |
metadata |
Optional data frame of sample annotations to merge with PCA coordinates |
List with components: coords (data frame with PC1, PC2, ...), var_explained (numeric vector of variance proportions), n_genes_used
Other differential-expression:
plot_heatmap_de(),
plot_ma(),
plot_pca(),
plot_volcano(),
run_deseq2(),
run_vst(),
summarize_de_methods()
Classifies patients into R-ISS stages based on ISS stage, cytogenetic risk group, and LDH level (Palumbo et al., JCO 2015).
compute_riss(iss_stage, risk_group, ldh = NA_real_, ldh_uln = 250)compute_riss(iss_stage, risk_group, ldh = NA_real_, ldh_uln = 250)
iss_stage |
Character vector: "Stage I", "Stage II", "Stage III" |
risk_group |
Character vector: "High", "Standard", "high", "standard" |
ldh |
Numeric vector: LDH value (U/L), or NA |
ldh_uln |
Numeric scalar: upper limit of normal for LDH (default 250) |
Character vector: "R-ISS I", "R-ISS II", "R-ISS III", or NA
Other cytogenetics:
calculate_cooccurrence(),
extract_cytogenetic_data(),
plot_cooccurrence_heatmap(),
plot_cytogenetic_oncoprint(),
plot_expression_by_subtype(),
summarize_cytogenetics()
compute_riss("Stage I", "Standard", ldh = 200) compute_riss("Stage III", "High", ldh = 300) compute_riss( c("Stage I", "Stage III", "Stage II"), c("Standard", "High", "Standard"), ldh = c(200, 300, NA) )compute_riss("Stage I", "Standard", ldh = 200) compute_riss("Stage III", "High", ldh = 300) compute_riss( c("Stage I", "Stage III", "Stage II"), c("Standard", "High", "Standard"), ldh = c(200, 300, NA) )
Computes Pearson or Spearman correlation between two genes in an expression matrix.
correlate_genes(expr_matrix, gene_x, gene_y, method = c("pearson", "spearman"))correlate_genes(expr_matrix, gene_x, gene_y, method = c("pearson", "spearman"))
expr_matrix |
Numeric matrix (genes x samples) of transformed expression values (e.g., VST) |
gene_x |
Gene name for x-axis |
gene_y |
Gene name for y-axis |
method |
Correlation method: "pearson" (default) or "spearman" |
List with components: gene_x, gene_y, method, estimate, p_value, n_samples, expr_x (numeric vector), expr_y (numeric vector)
Other gene-correlation:
correlate_genes_batch(),
plot_gene_correlation()
Correlates a target gene against a vector of candidate genes and returns a summary data frame sorted by absolute correlation.
correlate_genes_batch( expr_matrix, target_gene, candidate_genes, method = c("pearson", "spearman") )correlate_genes_batch( expr_matrix, target_gene, candidate_genes, method = c("pearson", "spearman") )
expr_matrix |
Numeric matrix (genes x samples) |
target_gene |
Gene name to correlate against |
candidate_genes |
Character vector of gene names to test |
method |
Correlation method: "pearson" or "spearman" |
Data frame with columns: gene, estimate, p_value, padj, n_samples, method
Other gene-correlation:
correlate_genes(),
plot_gene_correlation()
Create project directories
create_project_dirs(base_dir = ".")create_project_dirs(base_dir = ".")
base_dir |
Base directory path (default: ".") |
Generate a nicely formatted summary statistics table for numeric variables
create_summary_table(data, vars = NULL)create_summary_table(data, vars = NULL)
data |
Data frame |
vars |
Character vector of variable names (NULL for all numeric) |
Data frame with summary statistics
Other utilities:
example_data(),
export_h5ad(),
format_file_size(),
format_with_commas(),
gene_report(),
strip_plotly()
Download data from AWS S3 open access bucket
download_aws_data( bucket_name = "gdc-mmrf-commpass-phs000748-2-open", prefix = NULL, data_dir = "data/raw/aws", region = "us-east-1" )download_aws_data( bucket_name = "gdc-mmrf-commpass-phs000748-2-open", prefix = NULL, data_dir = "data/raw/aws", region = "us-east-1" )
bucket_name |
S3 bucket name |
prefix |
File prefix to filter |
data_dir |
Local directory for downloads |
region |
AWS region |
Directory path as character string
Downloads clinical and biospecimen data from the Genomic Data Commons (GDC) for the specified project. Data is saved in parquet and RDS formats.
download_clinical_data( project_id = "MMRF-COMMPASS", data_dir = "data/raw/clinical", use_parquet = TRUE )download_clinical_data( project_id = "MMRF-COMMPASS", data_dir = "data/raw/clinical", use_parquet = TRUE )
project_id |
Project identifier (default: "MMRF-COMMPASS") |
data_dir |
Directory to save data |
use_parquet |
If TRUE, also save parquet files (default TRUE) |
Path to the directory containing the saved data files
Other data-acquisition:
acquire_commpass_data(),
download_gdc_rnaseq(),
download_s3_subset(),
get_commpass_clinical(),
list_s3_commpass(),
query_commpass_rna()
## Not run: # Download clinical data clinical_dir <- download_clinical_data() # Load from parquet clinical <- arrow::read_parquet(file.path(clinical_dir, "clinical_data.parquet")) ## End(Not run)## Not run: # Download clinical data clinical_dir <- download_clinical_data() # Load from parquet clinical <- arrow::read_parquet(file.path(clinical_dir, "clinical_data.parquet")) ## End(Not run)
Downloads RNA-seq gene expression data from the Genomic Data Commons (GDC) for the specified project. Data is saved as a SummarizedExperiment RDS and as parquet files (counts, sample metadata, gene metadata).
download_gdc_rnaseq( project_id = "MMRF-COMMPASS", data_dir = "data/raw/gdc", sample_limit = 200, random_sample = TRUE, seed = 42, use_parquet = TRUE )download_gdc_rnaseq( project_id = "MMRF-COMMPASS", data_dir = "data/raw/gdc", sample_limit = 200, random_sample = TRUE, seed = 42, use_parquet = TRUE )
project_id |
Project identifier (default: "MMRF-COMMPASS") |
data_dir |
Directory to save data |
sample_limit |
Maximum number of samples (NULL for all, default 200) |
random_sample |
If TRUE and sample_limit is set, randomly sample patients using seed for reproducibility (default TRUE) |
seed |
Random seed for reproducible sampling (default 42) |
use_parquet |
If TRUE, also save parquet files alongside RDS (default TRUE) |
Path to the saved RDS file containing the SummarizedExperiment
Other data-acquisition:
acquire_commpass_data(),
download_clinical_data(),
download_s3_subset(),
get_commpass_clinical(),
list_s3_commpass(),
query_commpass_rna()
## Not run: # Download 200 random samples with parquet output rnaseq_file <- download_gdc_rnaseq(sample_limit = 200) # Load the SummarizedExperiment se_data <- readRDS(rnaseq_file) # Or load parquet directly counts <- arrow::read_parquet("data/raw/gdc/rnaseq_counts.parquet") ## End(Not run)## Not run: # Download 200 random samples with parquet output rnaseq_file <- download_gdc_rnaseq(sample_limit = 200) # Load the SummarizedExperiment se_data <- readRDS(rnaseq_file) # Or load parquet directly counts <- arrow::read_parquet("data/raw/gdc/rnaseq_counts.parquet") ## End(Not run)
Downloads a subset of RNA-seq files from the public MMRF CoMMpass S3 bucket. This function is useful for testing and development with a small sample of data before downloading the full dataset. Files are downloaded using anonymous access to the public bucket.
download_s3_subset(s3_paths, dest_dir = "data/raw/rna_seq", n = 3)download_s3_subset(s3_paths, dest_dir = "data/raw/rna_seq", n = 3)
s3_paths |
Character vector of S3 object keys (file paths) to download from |
dest_dir |
Destination directory for downloaded files (default: "data/raw/rna_seq") |
n |
Number of files to download (default: 3) |
Character vector of successfully downloaded file paths
Other data-acquisition:
acquire_commpass_data(),
download_clinical_data(),
download_gdc_rnaseq(),
get_commpass_clinical(),
list_s3_commpass(),
query_commpass_rna()
## Not run: # List available files s3_files <- list_s3_commpass() # Download first 3 RNA-seq files downloaded <- download_s3_subset(s3_files, n = 3) # Check what was downloaded basename(downloaded) ## End(Not run)## Not run: # List available files s3_files <- list_s3_commpass() # Download first 3 RNA-seq files downloaded <- download_s3_subset(s3_files, n = 3) # Check what was downloaded basename(downloaded) ## End(Not run)
Returns small **synthetic** datasets (no real patient data) for interactive exploration and testing. The data exercises the full pipeline: QC, cleaning, survival analysis, and cytogenetic classification.
example_data()example_data()
The SummarizedExperiment is reconstructed at load time from stored plain-R components (matrix + data.frames), so the RDS files have no S4 class dependency.
A named list with three elements:
A [SummarizedExperiment::SummarizedExperiment] with 50 genes x 20 samples. Assay named '"unstranded"' (matches GDC format). rowData contains 'gene_id' (Ensembl-format with version suffix) and 'gene_type'. colData contains 'submitter_id' and 'sample_type'.
A data.frame (20 patients) with columns 'submitter_id', 'vital_status', 'days_to_death', 'days_to_last_follow_up', 'age_at_diagnosis' (in days), 'gender', 'iss_stage', 'heavy_chain' (myeloma isotype), 'light_chain' (Kappa/Lambda), 'ecog_status' (0-4), 'ldh' (U/L), 'b2m' (beta-2 microglobulin, mg/L), 'albumin' (g/dL), 'flc_kappa' (free kappa light chain, mg/L), 'flc_lambda' (free lambda light chain, mg/L), 'hemoglobin' (g/dL), 'creatinine' (mg/dL), 'calcium' (corrected, mg/dL), 'platelets' (10^9/L).
A data.frame (20 patients) with columns 'patient_id', 't_4_14', 't_11_14', 't_14_16', 'del_17p', 'gain_1q', 'risk_group'.
A data.frame of treatment lines (1-3 per patient) with columns 'patient_id', 'treatment_line', 'regimen_name', 'regimen_class', 'best_response' (ordered factor), 'stem_cell_transplant' (logical), 'treatment_start_days' (integer).
Other utilities:
create_summary_table(),
export_h5ad(),
format_file_size(),
format_with_commas(),
gene_report(),
strip_plotly()
d <- example_data() # QC metrics qc <- calculate_qc_metrics(d$rnaseq_se) head(qc) # Clinical cleaning clin <- clean_clinical_data(d$clinical) # Survival data with cytogenetic markers surv <- prepare_survival_data(d$clinical, cyto_file = d$cytogenetic)d <- example_data() # QC metrics qc <- calculate_qc_metrics(d$rnaseq_se) head(qc) # Clinical cleaning clin <- clean_clinical_data(d$clinical) # Survival data with cytogenetic markers surv <- prepare_survival_data(d$clinical, cyto_file = d$cytogenetic)
Converts a [SummarizedExperiment::SummarizedExperiment] to AnnData format and writes an H5AD file for Python/scanpy interoperability.
export_h5ad(se, output_path, assay_name = NULL)export_h5ad(se, output_path, assay_name = NULL)
se |
A [SummarizedExperiment::SummarizedExperiment] object |
output_path |
Path for the output .h5ad file |
assay_name |
Which assay to export (default "unstranded", falls back to "counts" then first available) |
The output path (invisibly)
Other utilities:
create_summary_table(),
example_data(),
format_file_size(),
format_with_commas(),
gene_report(),
strip_plotly()
## Not run: d <- example_data() export_h5ad(d$rnaseq_se, "commpass.h5ad") ## End(Not run)## Not run: d <- example_data() export_h5ad(d$rnaseq_se, "commpass.h5ad") ## End(Not run)
Parses clinical data from GDC to extract translocation and copy number alteration status. GDC clinical data includes FISH results for standard myeloma cytogenetic markers.
extract_cytogenetic_data(clinical_dir)extract_cytogenetic_data(clinical_dir)
clinical_dir |
Path to clinical data directory (from download_clinical_data) |
Path to saved cytogenetic parquet file
Other cytogenetics:
calculate_cooccurrence(),
compute_riss(),
plot_cooccurrence_heatmap(),
plot_cytogenetic_oncoprint(),
plot_expression_by_subtype(),
summarize_cytogenetics()
## Not run: cyto_file <- extract_cytogenetic_data("data/raw/clinical") cyto <- arrow::read_parquet(cyto_file) ## End(Not run)## Not run: cyto_file <- extract_cytogenetic_data("data/raw/clinical") cyto <- arrow::read_parquet(cyto_file) ## End(Not run)
Generates a number-at-risk table at fixed time points from a [run_kaplan_meier()] result. Useful for pre-computing Shiny display data and static API endpoints.
extract_risk_table(km_result, times = c(0, 365, 730, 1095, 1460))extract_risk_table(km_result, times = c(0, 365, 730, 1095, 1460))
km_result |
List returned by [run_kaplan_meier()] |
times |
Numeric vector of time points (in days) at which to evaluate the risk table. Defaults to 'c(0, 365, 730, 1095, 1460)' (0-4 years). |
Data frame with columns: time, n_risk, n_event, n_censor, survival, and optionally strata (if stratified KM)
Other survival:
plot_forest(),
plot_km(),
prepare_survival_data(),
run_cox_regression(),
run_kaplan_meier(),
run_km_by_expression(),
run_km_by_markers()
Filter low-quality samples and genes
filter_low_quality(se_data, min_counts = 10, min_samples = 3)filter_low_quality(se_data, min_counts = 10, min_samples = 3)
se_data |
A SummarizedExperiment object |
min_counts |
Minimum count threshold per gene |
min_samples |
Minimum samples with counts above threshold |
Filtered SummarizedExperiment
Find consensus DE genes across methods
find_consensus_genes(de_results_list, padj_threshold = 0.05, lfc_threshold = 1)find_consensus_genes(de_results_list, padj_threshold = 0.05, lfc_threshold = 1)
de_results_list |
List of DE result objects from different methods |
padj_threshold |
Adjusted p-value threshold (default: 0.05) |
lfc_threshold |
Log2 fold change threshold (default: 1) |
List with consensus gene information
Converts file sizes from bytes to human-readable format with appropriate units
format_file_size(size_bytes, digits = 1)format_file_size(size_bytes, digits = 1)
size_bytes |
Numeric vector of file sizes in bytes |
digits |
Number of decimal places to show (default: 1) |
Character vector with formatted file sizes
Other utilities:
create_summary_table(),
example_data(),
export_h5ad(),
format_with_commas(),
gene_report(),
strip_plotly()
format_file_size(c(1024, 1048576, 5000877192)) # Returns: "1.0 KB", "1.0 MB", "4.7 GB"format_file_size(c(1024, 1048576, 5000877192)) # Returns: "1.0 KB", "1.0 MB", "4.7 GB"
Adds commas as thousands separators to large numbers
format_with_commas(x)format_with_commas(x)
x |
Numeric value or vector |
Character vector with formatted numbers
Other utilities:
create_summary_table(),
example_data(),
export_h5ad(),
format_file_size(),
gene_report(),
strip_plotly()
format_with_commas(1234567) # Returns: "1,234,567"format_with_commas(1234567) # Returns: "1,234,567"
Renders the parameterized 'gene-report.qmd' vignette for a specific gene, producing a self-contained HTML report with expression distribution, survival stratification, cytogenetic context, and gene-gene correlations.
gene_report( gene, output_dir = "results/gene_reports", vst_matrix = NULL, surv_data = NULL, cyto_data = NULL )gene_report( gene, output_dir = "results/gene_reports", vst_matrix = NULL, surv_data = NULL, cyto_data = NULL )
gene |
Gene name (e.g., "CD70", "TP53") |
output_dir |
Directory for the rendered report (default "results/gene_reports/") |
vst_matrix |
Optional pre-loaded VST matrix (genes x samples). If NULL, the function attempts to load from the targets store. |
surv_data |
Optional pre-loaded survival data frame. If NULL, attempts to load from the targets store. |
cyto_data |
Optional pre-loaded cytogenetic data frame |
Path to the rendered HTML file (invisibly)
Other utilities:
create_summary_table(),
example_data(),
export_h5ad(),
format_file_size(),
format_with_commas(),
strip_plotly()
## Not run: gene_report("CD70") gene_report("TP53", output_dir = "results/") ## End(Not run)## Not run: gene_report("CD70") gene_report("TP53", output_dir = "results/") ## End(Not run)
Wraps a data frame in standard API envelope with metadata and serializes to JSON. Used by [plan_api] targets to produce static JSON files.
generate_api_endpoint(data, dataset_name, description)generate_api_endpoint(data, dataset_name, description)
data |
A data frame to serialize, or NULL if data is unavailable |
dataset_name |
Short name for this dataset (e.g. "clinical") |
description |
Human-readable description |
A JSON string (character scalar) with 'metadata' and 'data' fields
Other api:
api_get_clinical(),
api_get_de_results(),
api_get_pathways(),
api_get_survival(),
api_list_datasets(),
api_serve(),
generate_api_index()
## Not run: df <- data.frame(a = 1:3, b = letters[1:3]) json <- generate_api_endpoint(df, "example", "Example data") cat(json) ## End(Not run)## Not run: df <- data.frame(a = 1:3, b = letters[1:3]) json <- generate_api_endpoint(df, "example", "Example data") cat(json) ## End(Not run)
Creates the index.json content listing all available endpoints, their descriptions, and URLs for the static JSON API.
generate_api_index( base_url = "https://JohnGavin.github.io/coMMpass-analysis/api/v1" )generate_api_index( base_url = "https://JohnGavin.github.io/coMMpass-analysis/api/v1" )
base_url |
Base URL for the static API. Defaults to the GitHub Pages URL. |
A list suitable for JSON serialization containing endpoint catalogue, version, and generated timestamp.
Other api:
api_get_clinical(),
api_get_de_results(),
api_get_pathways(),
api_get_survival(),
api_list_datasets(),
api_serve(),
generate_api_endpoint()
## Not run: idx <- generate_api_index() cat(jsonlite::toJSON(idx, pretty = TRUE, auto_unbox = TRUE)) ## End(Not run)## Not run: idx <- generate_api_index() cat(jsonlite::toJSON(idx, pretty = TRUE, auto_unbox = TRUE)) ## End(Not run)
Generate summary report
generate_summary_report( qc_metrics, de_genes, survival, pathways, output_dir = "results/reports" )generate_summary_report( qc_metrics, de_genes, survival, pathways, output_dir = "results/reports" )
qc_metrics |
QC metrics data frame |
de_genes |
DE results with consensus gene information |
survival |
Survival analysis results |
pathways |
Pathway analysis results |
output_dir |
Directory for output report |
Path to generated report
Uses the DAG to identify the minimal sufficient adjustment set for estimating the causal effect of 'exposure' on 'outcome'.
get_adjustment_sets( exposure = "cytogenetic_risk", outcome = "overall_survival", dag = NULL )get_adjustment_sets( exposure = "cytogenetic_risk", outcome = "overall_survival", dag = NULL )
exposure |
Character. The exposure variable name in the DAG. |
outcome |
Character. The outcome variable name in the DAG. |
dag |
A 'dagitty' DAG object. Defaults to [commpass_dag()]. |
A list of character vectors, each a minimal sufficient adjustment set.
Retrieves clinical data for the MMRF CoMMpass study from the Genomic Data Commons (GDC). This includes patient demographics, disease characteristics, treatment information, and outcomes data.
get_commpass_clinical()get_commpass_clinical()
A data frame containing clinical data for CoMMpass patients
Other data-acquisition:
acquire_commpass_data(),
download_clinical_data(),
download_gdc_rnaseq(),
download_s3_subset(),
list_s3_commpass(),
query_commpass_rna()
## Not run: # Get clinical data clinical <- get_commpass_clinical() # View first few rows head(clinical) # Check available columns names(clinical) ## End(Not run)## Not run: # Get clinical data clinical <- get_commpass_clinical() # View first few rows head(clinical) # Check available columns names(clinical) ## End(Not run)
Returns a tibble documenting all known variables in the CoMMpass dataset, including clinical, biospecimen, and RNA-seq data. Each variable includes its category, data type, units, description, typical range, and a link to the GDC data dictionary.
get_commpass_data_dictionary()get_commpass_data_dictionary()
A tibble with columns: variable, category, data_type, units, description, typical_range, gdc_link
Other data-dictionary:
get_variable_docs()
dd <- get_commpass_data_dictionary() # Filter to clinical variables dplyr::filter(dd, category == "clinical")dd <- get_commpass_data_dictionary() # Filter to clinical variables dplyr::filter(dd, category == "clinical")
Returns a lazy 'dplyr::tbl()' backed by DuckDB, reading from parquet files. The connection is managed by the caller and must be disconnected when done.
get_commpass_tbl( data_type = c("clinical", "biospecimen", "rnaseq_counts", "rnaseq_sample_metadata", "rnaseq_gene_metadata"), data_dir = "data/raw", con = NULL )get_commpass_tbl( data_type = c("clinical", "biospecimen", "rnaseq_counts", "rnaseq_sample_metadata", "rnaseq_gene_metadata"), data_dir = "data/raw", con = NULL )
data_type |
One of "clinical", "biospecimen", "rnaseq_counts", "rnaseq_sample_metadata", "rnaseq_gene_metadata" |
data_dir |
Base directory containing parquet files |
con |
An existing DuckDB connection. If NULL, a new in-memory connection is created and returned as an attribute of the result. |
A lazy dbplyr tbl. If con was NULL, the DuckDB connection is stored as attr(result, "connection") - caller must disconnect it.
Other storage:
query_commpass_parquet()
## Not run: # Create connection and query con <- DBI::dbConnect(duckdb::duckdb()) clinical_tbl <- get_commpass_tbl("clinical", con = con) # Chain dplyr operations (lazy - not executed until collect) result <- clinical_tbl |> dplyr::filter(gender == "female") |> dplyr::select(submitter_id, age_at_diagnosis, vital_status) |> dplyr::collect() # Clean up DBI::dbDisconnect(con, shutdown = TRUE) ## End(Not run)## Not run: # Create connection and query con <- DBI::dbConnect(duckdb::duckdb()) clinical_tbl <- get_commpass_tbl("clinical", con = con) # Chain dplyr operations (lazy - not executed until collect) result <- clinical_tbl |> dplyr::filter(gender == "female") |> dplyr::select(submitter_id, age_at_diagnosis, vital_status) |> dplyr::collect() # Clean up DBI::dbDisconnect(con, shutdown = TRUE) ## End(Not run)
GDC STAR-Counts uses 'unstranded', but fallback to 'counts' if present
get_counts_assay(se)get_counts_assay(se)
se |
SummarizedExperiment object |
Counts matrix
Returns detailed documentation for a specific variable, including scientific context, calculation methods, and usage notes.
get_variable_docs(variable)get_variable_docs(variable)
variable |
Character string naming the variable to document |
A list with elements: variable, description, scientific_context, calculation, usage_notes, references
Other data-dictionary:
get_commpass_data_dictionary()
docs <- get_variable_docs("age_at_diagnosis") cat(docs$usage_notes)docs <- get_variable_docs("age_at_diagnosis") cat(docs$usage_notes)
Combines clinical and expression data with consistent sample IDs
integrate_clinical_expression(clinical_clean, expr_clean)integrate_clinical_expression(clinical_clean, expr_clean)
clinical_clean |
Cleaned clinical data |
expr_clean |
Cleaned expression data |
List with matched clinical and expression data
Other data-cleaning:
clean_clinical_data(),
clean_expression_data(),
clean_treatment_data(),
summarize_treatment()
Lists files available in the public AWS S3 bucket containing MMRF CoMMpass data. The bucket contains RNA-seq, genomic, and clinical data files. This function uses anonymous access to the public bucket.
list_s3_commpass(prefix = "")list_s3_commpass(prefix = "")
prefix |
Optional prefix to filter files (e.g., "RNA-seq/", "clinical/") |
Character vector of S3 object keys (file paths)
Other data-acquisition:
acquire_commpass_data(),
download_clinical_data(),
download_gdc_rnaseq(),
download_s3_subset(),
get_commpass_clinical(),
query_commpass_rna()
## Not run: # List all available files (limited to first 100) files <- list_s3_commpass() # List only RNA-seq files rna_files <- list_s3_commpass(prefix = "RNA-seq/") # Count available files length(files) ## End(Not run)## Not run: # List all available files (limited to first 100) files <- list_s3_commpass() # List only RNA-seq files rna_files <- list_s3_commpass(prefix = "RNA-seq/") # Count available files length(files) ## End(Not run)
Normalize RNA-seq data
normalize_rnaseq(se_data, method = "TMM")normalize_rnaseq(se_data, method = "TMM")
se_data |
A SummarizedExperiment object with a "counts" assay |
method |
Normalization method (default: "TMM") |
SummarizedExperiment with added "logCPM" assay
Displays pairwise co-occurrence patterns as a heatmap. Color indicates -log10(p-value) direction: red for co-occurrence, blue for mutual exclusivity.
plot_cooccurrence_heatmap(cooccurrence, title = "Cytogenetic Co-occurrence")plot_cooccurrence_heatmap(cooccurrence, title = "Cytogenetic Co-occurrence")
cooccurrence |
Data frame from calculate_cooccurrence() |
title |
Plot title |
A ggplot object
Other cytogenetics:
calculate_cooccurrence(),
compute_riss(),
extract_cytogenetic_data(),
plot_cytogenetic_oncoprint(),
plot_expression_by_subtype(),
summarize_cytogenetics()
## Not run: cooc <- calculate_cooccurrence(cyto) plot_cooccurrence_heatmap(cooc) ## End(Not run)## Not run: cooc <- calculate_cooccurrence(cyto) plot_cooccurrence_heatmap(cooc) ## End(Not run)
Creates an oncoprint-style heatmap showing cytogenetic alterations across patients. Rows are alterations, columns are patients. Uses ggplot2 for portability (no ComplexHeatmap dependency).
plot_cytogenetic_oncoprint( cyto_data, markers = NULL, sort_by = c("frequency", "risk"), title = "Cytogenetic Landscape" )plot_cytogenetic_oncoprint( cyto_data, markers = NULL, sort_by = c("frequency", "risk"), title = "Cytogenetic Landscape" )
cyto_data |
Data frame from extract_cytogenetic_data() with columns: patient_id, iss_stage, t_4_14, t_11_14, t_14_16, del_17p, gain_1q, etc. |
markers |
Character vector of marker columns to include. Default uses all available markers. |
sort_by |
How to sort patients. One of "frequency" (default) or "risk". |
title |
Plot title. |
A ggplot object
Other cytogenetics:
calculate_cooccurrence(),
compute_riss(),
extract_cytogenetic_data(),
plot_cooccurrence_heatmap(),
plot_expression_by_subtype(),
summarize_cytogenetics()
## Not run: cyto <- arrow::read_parquet("data/raw/clinical/cytogenetic_data.parquet") plot_cytogenetic_oncoprint(cyto) ## End(Not run)## Not run: cyto <- arrow::read_parquet("data/raw/clinical/cytogenetic_data.parquet") plot_cytogenetic_oncoprint(cyto) ## End(Not run)
Renders the DAG using ggdag with sensible defaults.
plot_dag(dag = NULL, title = "CoMMpass Causal DAG")plot_dag(dag = NULL, title = "CoMMpass Causal DAG")
dag |
A 'dagitty' DAG object. Defaults to [commpass_dag()]. |
title |
Plot title. |
A ggplot object.
Creates a horizontal bar plot of the top N enriched pathways, colored by direction (GSEA) or significance (ORA).
plot_enrichment_barplot(enrich_result, n = 15L, title = "Enrichment Bar Plot")plot_enrichment_barplot(enrich_result, n = 15L, title = "Enrichment Bar Plot")
enrich_result |
Data frame with pathway enrichment results |
n |
Number of top pathways to show (default 15) |
title |
Plot title |
A ggplot2 object
Other pathway:
annotate_de_results(),
annotate_genes(),
plot_enrichment_dotplot(),
plot_gsea_running_score(),
run_gsea(),
run_ora(),
run_pathway_analysis()
Creates a dot plot of the top N enriched pathways. Dot size represents gene set size, color represents significance (adjusted p-value or NES).
plot_enrichment_dotplot( enrich_result, n = 15L, color_by = "padj", title = "Enrichment Dot Plot" )plot_enrichment_dotplot( enrich_result, n = 15L, color_by = "padj", title = "Enrichment Dot Plot" )
enrich_result |
Data frame with pathway enrichment results. Must contain columns: pathway, padj, and one of: NES (GSEA), overlap (ORA). |
n |
Number of top pathways to show (default 15) |
color_by |
Which metric to color by: "padj" (default) or "NES" |
title |
Plot title |
A ggplot2 object
Other pathway:
annotate_de_results(),
annotate_genes(),
plot_enrichment_barplot(),
plot_gsea_running_score(),
run_gsea(),
run_ora(),
run_pathway_analysis()
Creates violin + box plots showing gene expression stratified by cytogenetic marker status (positive vs negative). Adds Wilcoxon or t-test p-value annotation.
plot_expression_by_subtype( expr_matrix, cyto_data, gene, markers = NULL, test = c("wilcox", "t.test") )plot_expression_by_subtype( expr_matrix, cyto_data, gene, markers = NULL, test = c("wilcox", "t.test") )
expr_matrix |
Numeric matrix (genes x samples) of transformed expression values (e.g., VST). Column names should match 'cyto_data$patient_id'. |
cyto_data |
Data frame with 'patient_id' and marker columns (values: "positive"/"negative"/NA) |
gene |
Gene name (rowname of 'expr_matrix') |
markers |
Character vector of marker columns. Default auto-detects. |
test |
Statistical test: "wilcox" (default) or "t.test" |
A ggplot object with faceted violin/box plots, one panel per marker
Other cytogenetics:
calculate_cooccurrence(),
compute_riss(),
extract_cytogenetic_data(),
plot_cooccurrence_heatmap(),
plot_cytogenetic_oncoprint(),
summarize_cytogenetics()
Creates a forest plot from Cox regression results, showing hazard ratios with 95
plot_forest(cox_result, title = "Cox Regression Forest Plot")plot_forest(cox_result, title = "Cox Regression Forest Plot")
cox_result |
List from [run_cox_regression()] with 'hazard_ratios', 'concordance', 'n', 'n_events' |
title |
Plot title |
A ggplot object
Other survival:
extract_risk_table(),
plot_km(),
prepare_survival_data(),
run_cox_regression(),
run_kaplan_meier(),
run_km_by_expression(),
run_km_by_markers()
Creates a scatter plot with regression line, correlation coefficient, and p-value annotation from a [correlate_genes()] result.
plot_gene_correlation(cor_result, title = NULL)plot_gene_correlation(cor_result, title = NULL)
cor_result |
List from [correlate_genes()] |
title |
Plot title (default auto-generated) |
A ggplot object
Other gene-correlation:
correlate_genes(),
correlate_genes_batch()
Shows the running enrichment score for a specific gene set from fgsea results, along with the ranked gene positions.
plot_gsea_running_score( gsea_result, gene_set_name, gene_sets = NULL, title = NULL )plot_gsea_running_score( gsea_result, gene_set_name, gene_sets = NULL, title = NULL )
gsea_result |
List from run_gsea() containing results and ranked_genes |
gene_set_name |
Name of the gene set to plot |
gene_sets |
Named list of gene sets (required for tick marks). If NULL, attempts to retrieve from msigdbr using the collection. |
title |
Plot title (default: gene set name) |
A ggplot2 object
Other pathway:
annotate_de_results(),
annotate_genes(),
plot_enrichment_barplot(),
plot_enrichment_dotplot(),
run_gsea(),
run_ora(),
run_pathway_analysis()
Creates a heatmap of the top differentially expressed genes using z-score scaled expression values.
plot_heatmap_de( expr_matrix, de_results, n_genes = 50L, annotation_df = NULL, title = "Top DE Genes" )plot_heatmap_de( expr_matrix, de_results, n_genes = 50L, annotation_df = NULL, title = "Top DE Genes" )
expr_matrix |
Numeric matrix of transformed expression (genes x samples) |
de_results |
Data frame with DE results (rownames = gene IDs) |
n_genes |
Number of top DE genes to show (default 50) |
annotation_df |
Optional data frame of sample annotations for column annotation |
title |
Plot title |
A ggplot2 object (tile-based heatmap)
Other differential-expression:
compute_pca(),
plot_ma(),
plot_pca(),
plot_volcano(),
run_deseq2(),
run_vst(),
summarize_de_methods()
Creates a ggplot2-based KM survival curve from a survfit object. Supports stratified and unstratified curves with optional log-rank p-value.
plot_km( km_result, title = "Kaplan-Meier Survival Curve", xlab = "Time (days)", time_breaks = NULL )plot_km( km_result, title = "Kaplan-Meier Survival Curve", xlab = "Time (days)", time_breaks = NULL )
km_result |
List from [run_kaplan_meier()] with 'fit', 'logrank_p', 'median_survival', 'n_per_group', 'strata' |
title |
Plot title |
xlab |
X-axis label |
time_breaks |
Sequence of time breaks for x-axis (in days). Default marks every 365 days. |
A ggplot object
Other survival:
extract_risk_table(),
plot_forest(),
prepare_survival_data(),
run_cox_regression(),
run_kaplan_meier(),
run_km_by_expression(),
run_km_by_markers()
Plots mean expression (x-axis) vs log2 fold change (y-axis), a standard diagnostic for RNA-seq DE analysis.
plot_ma( de_results, lfc_threshold = 1, padj_threshold = 0.05, title = "MA Plot" )plot_ma( de_results, lfc_threshold = 1, padj_threshold = 0.05, title = "MA Plot" )
de_results |
Data frame with DE results |
lfc_threshold |
Log2 fold-change threshold for coloring (default 1) |
padj_threshold |
Adjusted p-value threshold (default 0.05) |
title |
Plot title |
A ggplot2 object
Other differential-expression:
compute_pca(),
plot_heatmap_de(),
plot_pca(),
plot_volcano(),
run_deseq2(),
run_vst(),
summarize_de_methods()
Creates a PCA scatter plot from pre-computed PCA data.
plot_pca(pca_data, color_by = NULL, shape_by = NULL, title = "PCA")plot_pca(pca_data, color_by = NULL, shape_by = NULL, title = "PCA")
pca_data |
List from compute_pca() with coords and var_explained |
color_by |
Column name in coords to color by (default NULL) |
shape_by |
Column name in coords to use for shape (default NULL) |
title |
Plot title |
A ggplot2 object
Other differential-expression:
compute_pca(),
plot_heatmap_de(),
plot_ma(),
plot_volcano(),
run_deseq2(),
run_vst(),
summarize_de_methods()
Creates a ggplot2 volcano plot with significance thresholds and optional gene labels.
plot_volcano( de_results, lfc_threshold = 1, padj_threshold = 0.05, n_label = 10L, title = "Volcano Plot" )plot_volcano( de_results, lfc_threshold = 1, padj_threshold = 0.05, n_label = 10L, title = "Volcano Plot" )
de_results |
Data frame with DE results (must have padj and log2FC columns – auto-detected from DESeq2/edgeR/limma conventions) |
lfc_threshold |
Log2 fold-change threshold for significance (default 1) |
padj_threshold |
Adjusted p-value threshold (default 0.05) |
n_label |
Number of top genes to label (default 10) |
title |
Plot title |
A ggplot2 object
Other differential-expression:
compute_pca(),
plot_heatmap_de(),
plot_ma(),
plot_pca(),
run_deseq2(),
run_vst(),
summarize_de_methods()
Constructs a data frame suitable for survival analysis from GDC clinical data. Uses 'days_to_death' for deceased patients and 'days_to_last_follow_up' for censored (alive) patients. Merges with cytogenetic risk groups when available.
prepare_survival_data(clinical_data, cyto_file = NULL)prepare_survival_data(clinical_data, cyto_file = NULL)
clinical_data |
Cleaned clinical data frame (from clean_clinical_data) |
cyto_file |
Path to cytogenetic parquet file (from extract_cytogenetic_data), or NULL to skip cytogenetic integration |
Data frame with columns: patient_id, time_days, status (0=censored, 1=dead), age_years, gender, iss_stage, risk_group, plus individual cytogenetic markers
Other survival:
extract_risk_table(),
plot_forest(),
plot_km(),
run_cox_regression(),
run_kaplan_meier(),
run_km_by_expression(),
run_km_by_markers()
Opens an in-memory DuckDB connection, creates a VIEW on the specified parquet file, optionally applies filters, and returns a data frame.
query_commpass_parquet( data_type = c("clinical", "biospecimen", "rnaseq_counts", "rnaseq_sample_metadata", "rnaseq_gene_metadata"), data_dir = "data/raw", filters = NULL, collect = TRUE )query_commpass_parquet( data_type = c("clinical", "biospecimen", "rnaseq_counts", "rnaseq_sample_metadata", "rnaseq_gene_metadata"), data_dir = "data/raw", filters = NULL, collect = TRUE )
data_type |
One of "clinical", "biospecimen", "rnaseq_counts", "rnaseq_sample_metadata", "rnaseq_gene_metadata" |
data_dir |
Base directory containing parquet files |
filters |
Optional named list of filters. Names are column names, values are vectors of allowed values (used in WHERE ... IN (...)) |
collect |
If TRUE (default), collect results into a data frame. If FALSE, return a lazy tbl for further dplyr operations. |
A data frame (if collect=TRUE) or lazy dbplyr tbl
Other storage:
get_commpass_tbl()
## Not run: # Read all clinical data clinical <- query_commpass_parquet("clinical") # Filter to specific patients subset <- query_commpass_parquet( "clinical", filters = list(gender = "female", vital_status = "Alive") ) # Get lazy tbl for chaining tbl <- query_commpass_parquet("clinical", collect = FALSE) result <- tbl |> dplyr::filter(gender == "female") |> dplyr::collect() ## End(Not run)## Not run: # Read all clinical data clinical <- query_commpass_parquet("clinical") # Filter to specific patients subset <- query_commpass_parquet( "clinical", filters = list(gender = "female", vital_status = "Alive") ) # Get lazy tbl for chaining tbl <- query_commpass_parquet("clinical", collect = FALSE) result <- tbl |> dplyr::filter(gender == "female") |> dplyr::collect() ## End(Not run)
Queries the Genomic Data Commons (GDC) API for Multiple Myeloma Research Foundation (MMRF) CoMMpass study RNA-seq data. This function returns a query object that can be used with other TCGAbiolinks functions to download and prepare the data.
query_commpass_rna()query_commpass_rna()
A GDCquery object containing metadata for RNA-seq samples
Other data-acquisition:
acquire_commpass_data(),
download_clinical_data(),
download_gdc_rnaseq(),
download_s3_subset(),
get_commpass_clinical(),
list_s3_commpass()
## Not run: # Query for RNA-seq data query <- query_commpass_rna() # Use the query to download data # GDCdownload(query) # data <- GDCprepare(query) ## End(Not run)## Not run: # Query for RNA-seq data query <- query_commpass_rna() # Use the query to download data # GDCdownload(query) # data <- GDCprepare(query) ## End(Not run)
Render DE analysis report
render_de_report(de_results, output_dir = "results/reports")render_de_report(de_results, output_dir = "results/reports")
de_results |
DE results object |
output_dir |
Directory for output report |
Path to generated report
Fits a Cox PH model with specified covariates. Returns the model object, hazard ratios with confidence intervals, and concordance index.
run_cox_regression(surv_data, covariates = c("age_years", "gender"))run_cox_regression(surv_data, covariates = c("age_years", "gender"))
surv_data |
Data frame from [prepare_survival_data()] |
covariates |
Character vector of covariate column names |
List with components: - 'model': coxph object - 'hazard_ratios': data frame with HR, CI, p-values - 'concordance': C-index - 'n': number of patients in model - 'n_events': number of events - 'ph_test': cox.zph result for PH assumption
Other survival:
extract_risk_table(),
plot_forest(),
plot_km(),
prepare_survival_data(),
run_kaplan_meier(),
run_km_by_expression(),
run_km_by_markers()
Runs DESeq2 with optional apeglm log-fold-change shrinkage and optional paired longitudinal design for within-patient comparisons.
run_deseq2( se_data, clinical_data, design_formula = ~condition, shrink_lfc = TRUE, paired = FALSE )run_deseq2( se_data, clinical_data, design_formula = ~condition, shrink_lfc = TRUE, paired = FALSE )
se_data |
A SummarizedExperiment object |
clinical_data |
Clinical data frame |
design_formula |
Design formula for the model |
shrink_lfc |
If TRUE, apply apeglm LFC shrinkage (default TRUE) |
paired |
If TRUE, use paired longitudinal design with patient as blocking factor. Requires 'patient_id' and 'visit' columns in clinical_data. Only patients with >=2 timepoints are included. |
List with DE results including raw and shrunken (if requested)
Other differential-expression:
compute_pca(),
plot_heatmap_de(),
plot_ma(),
plot_pca(),
plot_volcano(),
run_vst(),
summarize_de_methods()
Run edgeR differential expression analysis
run_edger(se_data, clinical_data, design_formula = ~condition)run_edger(se_data, clinical_data, design_formula = ~condition)
se_data |
A SummarizedExperiment object |
clinical_data |
Clinical data frame |
design_formula |
Design formula for the model |
List with DE results
Performs pre-ranked GSEA using fgsea on differential expression results.
Genes are ranked by their test statistic (DESeq2 Wald stat or
-log10(pvalue) * sign(log2FC)). Gene sets come from MSigDB via
msigdbr.
run_gsea( de_results, gene_sets = "hallmark", gene_id_type = "ensembl_gene", min_size = 15L, max_size = 500L )run_gsea( de_results, gene_sets = "hallmark", gene_id_type = "ensembl_gene", min_size = 15L, max_size = 500L )
de_results |
Data frame of DE results. Must contain gene identifiers
(as rownames or in a |
gene_sets |
Character string specifying the MSigDB collection:
|
gene_id_type |
Type of gene identifiers: |
min_size |
Minimum gene set size (default: 15). |
max_size |
Maximum gene set size (default: 500). |
List with components:
Data frame with pathway, NES, pval, padj, size, leadingEdge columns.
Number of gene sets tested.
Number significant at padj < 0.05.
Top 20 enriched gene sets as a data frame.
Named numeric vector of gene ranks used.
Gene set collection used.
Other pathway:
annotate_de_results(),
annotate_genes(),
plot_enrichment_barplot(),
plot_enrichment_dotplot(),
plot_gsea_running_score(),
run_ora(),
run_pathway_analysis()
Fits a Kaplan-Meier survival curve, optionally stratified by a grouping variable. Returns the survfit object, log-rank test, and summary statistics.
run_kaplan_meier(surv_data, strata = NULL)run_kaplan_meier(surv_data, strata = NULL)
surv_data |
Data frame from [prepare_survival_data()] with columns 'time_days' and 'status' |
strata |
Column name to stratify by (e.g. "risk_group", "iss_stage", "gender"), or NULL for overall curve |
List with components: - 'fit': survfit object - 'logrank': survdiff result (NULL if no strata) - 'logrank_p': log-rank p-value (NA if no strata) - 'median_survival': named vector of median survival times - 'n_per_group': sample sizes per stratum - 'strata': the stratification variable used
Other survival:
extract_risk_table(),
plot_forest(),
plot_km(),
prepare_survival_data(),
run_cox_regression(),
run_km_by_expression(),
run_km_by_markers()
Splits patients into groups based on expression of a single gene (e.g., median split) and runs Kaplan-Meier analysis with log-rank test. Connects differential expression results to clinical outcomes.
run_km_by_expression( surv_data, expr_matrix, gene, split = c("median", "tertile", "quartile", "top_bottom_20"), min_per_group = 5L )run_km_by_expression( surv_data, expr_matrix, gene, split = c("median", "tertile", "quartile", "top_bottom_20"), min_per_group = 5L )
surv_data |
Data frame from [prepare_survival_data()] with columns 'patient_id', 'time_days', 'status' |
expr_matrix |
Numeric matrix (genes x samples) of transformed expression values (e.g., VST). Column names must match 'surv_data$patient_id'. |
gene |
Gene name (must be a rowname of 'expr_matrix') |
split |
Split method: "median" (default), "tertile", "quartile", or "top_bottom_20" |
min_per_group |
Minimum patients per group (default 5) |
List compatible with [plot_km()]: fit, logrank_p, median_survival, n_per_group, strata, gene, split_method. Returns a list with 'fit = NULL' and a 'note' if requirements not met.
Other survival:
extract_risk_table(),
plot_forest(),
plot_km(),
prepare_survival_data(),
run_cox_regression(),
run_kaplan_meier(),
run_km_by_markers()
Iterates over cytogenetic markers and runs stratified KM analysis for each. Markers with fewer than 'min_positive' positive cases are skipped.
run_km_by_markers(surv_data, markers = NULL, min_positive = 3L)run_km_by_markers(surv_data, markers = NULL, min_positive = 3L)
surv_data |
Data frame from [prepare_survival_data()] |
markers |
Character vector of marker column names. Default auto-detects. |
min_positive |
Minimum number of positive cases to run analysis (default 3) |
Named list of KM results, one per marker
Other survival:
extract_risk_table(),
plot_forest(),
plot_km(),
prepare_survival_data(),
run_cox_regression(),
run_kaplan_meier(),
run_km_by_expression()
Run limma differential expression analysis
run_limma(se_data, clinical_data, design_formula = ~condition)run_limma(se_data, clinical_data, design_formula = ~condition)
se_data |
A SummarizedExperiment object |
clinical_data |
Clinical data frame |
design_formula |
Design formula for the model |
List with DE results
Tests whether a set of significant genes is over-represented in gene set collections using Fisher's exact test (hypergeometric distribution). Gene sets come from MSigDB via msigdbr.
run_ora( sig_genes, universe, gene_sets = "hallmark", gene_id_type = "ensembl_gene", min_size = 10L, max_size = 500L )run_ora( sig_genes, universe, gene_sets = "hallmark", gene_id_type = "ensembl_gene", min_size = 10L, max_size = 500L )
sig_genes |
Character vector of significant gene identifiers. |
universe |
Character vector of all tested gene identifiers (background). |
gene_sets |
Character string specifying the MSigDB collection (same options as [run_gsea()]), or a named list of character vectors. |
gene_id_type |
Type of gene identifiers: |
min_size |
Minimum gene set size (default: 10). |
max_size |
Maximum gene set size (default: 500). |
List with components:
Data frame with pathway, overlap, gene_set_size, universe_size, p_value, padj, odds_ratio, overlapping_genes.
Number of pathways tested.
Number significant at padj < 0.05.
Number of input significant genes.
Top 20 enriched pathways as a data frame.
Other pathway:
annotate_de_results(),
annotate_genes(),
plot_enrichment_barplot(),
plot_enrichment_dotplot(),
plot_gsea_running_score(),
run_gsea(),
run_pathway_analysis()
Wrapper that runs ORA on consensus DE genes. Uses MSigDB Hallmark gene sets by default.
run_pathway_analysis(de_genes, method = "hallmark")run_pathway_analysis(de_genes, method = "hallmark")
de_genes |
Consensus DE result from [find_consensus_genes()] with
|
method |
Gene set collection: |
List with ORA results (see [run_ora()]).
Other pathway:
annotate_de_results(),
annotate_genes(),
plot_enrichment_barplot(),
plot_enrichment_dotplot(),
plot_gsea_running_score(),
run_gsea(),
run_ora()
Wrapper around DESeq2::vst() for use in visualization. VST is appropriate for PCA, heatmaps, and clustering – not for DE testing (which uses raw counts).
run_vst(se_data, blind = TRUE)run_vst(se_data, blind = TRUE)
se_data |
SummarizedExperiment with "counts" assay |
blind |
Logical, whether VST should be blind to the design (default TRUE for exploratory analysis) |
SummarizedExperiment with "vst" assay added, or NULL if DESeq2 is not available
Other differential-expression:
compute_pca(),
plot_heatmap_de(),
plot_ma(),
plot_pca(),
plot_volcano(),
run_deseq2(),
summarize_de_methods()
Save results with timestamp
save_timestamped(object, base_name, dir = "results")save_timestamped(object, base_name, dir = "results")
object |
R object to save |
base_name |
Base name for the output file |
dir |
Directory to save to (default: "results") |
Setup logging
setup_logging(log_file = NULL)setup_logging(log_file = NULL)
log_file |
Optional path to a log file. If NULL, logs to console. |
Plotly htmlwidgets capture parent environments in closures, causing 60-80x size inflation when serialized via saveRDS/targets. This function runs plotly_build() to resolve lazy data, then replaces closure environments with emptyenv() to eliminate the bloat.
strip_plotly(p)strip_plotly(p)
p |
A plotly htmlwidget object |
The same plotly object with closures stripped (much smaller on disk)
Other utilities:
create_summary_table(),
example_data(),
export_h5ad(),
format_file_size(),
format_with_commas(),
gene_report()
## Not run: p <- plotly::plot_ly(mtcars, x = ~mpg, type = "histogram") object.size(p) # large p2 <- strip_plotly(p) object.size(p2) # small ## End(Not run)## Not run: p <- plotly::plot_ly(mtcars, x = ~mpg, type = "histogram") object.size(p) # large p2 <- strip_plotly(p) object.size(p2) # small ## End(Not run)
Computes frequency and percentage for each marker and risk group.
summarize_cytogenetics(cyto_data)summarize_cytogenetics(cyto_data)
cyto_data |
Data frame from extract_cytogenetic_data() |
Data frame with marker, n_positive, n_tested, pct columns
Other cytogenetics:
calculate_cooccurrence(),
compute_riss(),
extract_cytogenetic_data(),
plot_cooccurrence_heatmap(),
plot_cytogenetic_oncoprint(),
plot_expression_by_subtype()
## Not run: cyto <- arrow::read_parquet("data/raw/clinical/cytogenetic_data.parquet") summarize_cytogenetics(cyto) ## End(Not run)## Not run: cyto <- arrow::read_parquet("data/raw/clinical/cytogenetic_data.parquet") summarize_cytogenetics(cyto) ## End(Not run)
Generate summary statistics
summarize_data(se_data)summarize_data(se_data)
se_data |
A SummarizedExperiment object with a "counts" assay |
Creates a summary table showing the number of significant genes per DE method.
summarize_de_methods(de_list, padj_threshold = 0.05, lfc_threshold = 1)summarize_de_methods(de_list, padj_threshold = 0.05, lfc_threshold = 1)
de_list |
Named list of DE result lists (each with results_table) |
padj_threshold |
Adjusted p-value threshold (default 0.05) |
lfc_threshold |
Log2 fold-change threshold (default 1) |
Data frame with method, n_tested, n_sig, n_up, n_down
Other differential-expression:
compute_pca(),
plot_heatmap_de(),
plot_ma(),
plot_pca(),
plot_volcano(),
run_deseq2(),
run_vst()
Creates a one-row-per-patient summary of treatment history.
summarize_treatment(treatment_clean)summarize_treatment(treatment_clean)
treatment_clean |
Cleaned treatment data from [clean_treatment_data()] |
Data frame with columns: patient_id, n_lines, first_regimen, first_regimen_class, had_transplant, best_overall_response
Other data-cleaning:
clean_clinical_data(),
clean_expression_data(),
clean_treatment_data(),
integrate_clinical_expression()