Investigating the Role of Fluid Shear Stress in Metastasis using RNA-Seq Data

ISIB Program sponsored by the National Heart Lung and Blood Institute (NHLBI), Grant # HL161716-01. Faculty Mentor: Patrick Breheny

Jacob Gerber, Lauren Walker, Owen Gould, and Sia Gbondo-Tugbawa

Biological Terminology

  • Metastasis: process by which cancer cells break from their original cluster, spread to other parts of the body, and form a new cluster there

    • break off –> circulatory system –> invasion

    • Most cancers are treatable when cells don’t spread; become fatal once they spread throughout the body (e.g. prostate cancer spreading to the lungs or bone)

    • unique to cancer cells; healthy cells cannot detach and function in another organ/tissue

  • Cancer cells possibly primed for metastasis through fluid shear stress, specifically from blood circulation

    • Fluid shear stress on cancer cells may cause them to express certain genes differently, and this gene expression could help prepare them better for metastasis

    • Previous studies convey that cancer cells metastasize less when placed in a new location without stress from the circulatory system

Experiment at the Holden Comprehensive Cancer Center

  • Researchers (under PI Michael Henry) exposed cancer cells to fluid shear stress and measured the different gene expressions before and after the forces

    • Cells exposed to stress denoted “Sheared” and those that weren’t denoted “Static”
  • Measured the gene expressions using RNA-sequencing at 3 hours, 12 hours, and 24 hours for 3 trials

  • sample size of \(n = 18\) for each specific gene

    • (2 conditions) x (3 timestamps) x (3 trials)

RNA-Sequencing

  • Method of turning RNA material from tissues or cells into readable genomic data

    1. Isolate RNA from inside the sample cell
    2. Copy RNA to DNA with enzymes
    3. DNA is sequenced with a machine to output the counts of RNA molecules in a sample cell
  • Very widely used in genetics research and helps us quantify gene expression within a specific sample

    • more RNA molecules present = more gene expression

Research Question

How do the forces of blood flow affect the gene expression of cancer cells?

  • Analyzing 58735 genes, which take on roles like helping to make proteins, cell regulation, and non-coding material

  • Previous work suggests that cancer cells exposed to fluid shear stress tend to be better suited to metastasize (Leeuw et al. 2016)

  • Test whether certain genes are expressed significantly more or less under the stresses of blood flow, and whether the difference in gene expression might prepare cells for metastasis

  • How do we model & test this question?

Overdispersion

  • Poisson model popular for count data
  • However, assumes \(\text{mean}=\text{variance}\)
  • With RNA-Sequencing, this is rarely true
  • How to analyze these data?
    • Negative binomial is an option
    • Reduce to something familiar: linear models
    • Limma is an R package for fitting linear models to RNA-Seq data

Data Issues

For fitting a linear model, the current data has some poor behavior.

  • Many genes have zero (or near zero) expression counts

  • There is a clear trend in the variance

  • Small sample sizes means little data to base gene-wise variability estimates off of

Handling Low Counts

  • Keep genes that have an expression count of at least \(10\) in at least half of the observations

Example of a gene filtered out:

Gene Static3A Static3B Static3C Static12A Static12B Static12C Static24A Static24B Static24C Sheared3A Sheared3B Sheared3C Sheared12A Sheared12B Sheared12C Sheared24A Sheared24B Sheared24C
ENSG00000278267 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

Example of a gene kept in:

Gene Static3A Static3B Static3C Static12A Static12B Static12C Static24A Static24B Static24C Sheared3A Sheared3B Sheared3C Sheared12A Sheared12B Sheared12C Sheared24A Sheared24B Sheared24C
ENSG00000268903 62 72 51 103 102 120 95 138 260 53 43 61 94 108 98 81 104 106

  • Filtering reduces from \(58735\) genes to \(15401\) genes

Handling Variance with Voom

  • Use the voom function within limma

  • Transforms count data to \(\log_2\text{CPM}\) and estimates the mean-variance relationship

  • voom also computes weights for each observation based on precision

  • No variance trend after accounting for the weights from voom

Pooling Info with Empirical Bayes

  • Few observations means little data to base variability estimates off of

  • Use an Empirical Bayes method to pool information across genes

  • This gives a sense of how variability is distributed in the total gene population

  • The additional info allows for more accurate standard error estimates in model parameters

Linear Modeling

  • Interested in gene expression differences between static and sheared groups on average

  • Different genes may be expressed more or less at different time points

  • We fit the following linear model to each gene:

\[(\log_2\text{CPM})_i = \beta_{1,1} + \beta_{1,2}\text{(12H)}_i + \beta_{1,3}\text{(24H)}_i + \bigg(\beta_{2,1} + \beta_{2,2}\text{(12H)}_i + \beta_{2,3}\text{(24H)}_i\bigg)(\text{Sheared})_i + \varepsilon_i\]

  • Fold change quantifies the difference in means between sheared and static groups

\[\begin{align*} \text{3 Hours}&: \quad \log_2(\text{FC}) = \beta_{2,1}\\ \text{12 Hours}&: \quad \log_2(\text{FC}) = \beta_{2,1} + \beta_{2,2}\\ \text{24 Hours}&: \quad \log_2(\text{FC}) = \beta_{2,1} + \beta_{2,3} \end{align*}\]

Hypothesis Testing on Differences

  • The decideTests function of limma tests using a moderated t statistic of the form

\[\frac{\log_{2}(\text{FC})}{se(\log_2(\text{FC}))}\]

  • Can be interpreted the same as a standard t statistic, a difference divided by the standard error of the difference

  • The standard error here uses the pooled information from the Empirical Bayes

  • Significance is not determined from standard p-values

  • Adjusted p-values based on false discovery rate account for multiple hypothesis tests

Multiple Hypothesis Testing

  • For one test: significance level \(\alpha\) = \(P(\text{Type I Error})\)
  • One test per gene \(\implies\) Type I Error almost guaranteed
  • Corrections:
    • Family-wise Error Rate (FWER)
      • Probability of at least one false positive
      • e.g., Bonferroni Correction
    • False Discovery Rate (FDR)
      • Proportion of false positives among test
      • Controlled by Benjamini-Hochberg Procedure (BH)
  • BH used for higher power testing

Gene Set Enrichment Analysis

  • A statistical method used to determine whether a group of genes shows significant coordinated changes between two biological states
  • It is hard to look through multiple genes one by one to find patterns, so using gene sets avoids interpreting single genes
  • Provides insights into biological pathways and processes
  • A gene set is a group of genes with common biological functions
  • Gene regulation is process by which genes are turned on or off (whether or not they make proteins)

Results (3 Hour Group)

There were 84 significant genes.

Results (12 Hour Group)

There were 290 significant genes.

Results (24 Hour Group)

There were 254 significant genes.

Discussion

  • High-throughput data allows for reliable variability estimates even with small sample sizes

  • We were able to narrow down which gene pathways were significantly affected due to fluid shear stress

    • Candidates for further analysis on their direct relation to metastasis
  • Time had a noticeable effect on gene expression

    • 3H mostly up-regulated, 24H mostly down-regulated
    • Specific significant gene pathways changed over time
  • Although the RNA-seq data provides a good foundation, biological expertise is necessary to conduct deeper analyses on each gene pathway

References

Breheney, Patrick. 2025a. “False Discovery Rates.” University of Iowa; University Lecture.
———. 2025b. “Family-Wise Error Rates.” University of Iowa; University Lecture.
Leeuw, Christiaan A de, Benjamin M Neale, Tom Heskes, and Danielle Posthuma. 2016. “The Statistical Properties of Gene-Set Analysis.” Nat. Rev. Genet. 17 (6): 353–64.
Nova, Ian C. n.d. “RNA-Seq (RNA Sequencing).” Genome.gov. https://www.genome.gov/genetics-glossary/RNA-seq.
Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Smyth, Gordon K. 2004. “Limma Moderated t-Statistics and b-Statistics.” https://support.bioconductor.org/p/6124.