Investigating the Role of Fluid Shear Stress in Metastasis using RNA-Seq Data

ISIB Program sponsored by the National Heart Lung and Blood Institute (NHLBI), Grant # HL161716-01. Faculty Mentor: Patrick Breheny

Jacob Gerber, Lauren Walker, Owen Gould, and Sia Gbondo-Tugbawa

Biological Terminology

Metastasis: process by which cancer cells break from their original cluster, spread to other parts of the body, and form a new cluster there
- break off –> circulatory system –> invasion
- Most cancers are treatable when cells don’t spread; become fatal once they spread throughout the body (e.g. prostate cancer spreading to the lungs or bone)
- unique to cancer cells; healthy cells cannot detach and function in another organ/tissue
Cancer cells possibly primed for metastasis through fluid shear stress, specifically from blood circulation
- Fluid shear stress on cancer cells may cause them to express certain genes differently, and this gene expression could help prepare them better for metastasis
- Previous studies convey that cancer cells metastasize less when placed in a new location without stress from the circulatory system

Experiment at the Holden Comprehensive Cancer Center

Researchers (under PI Michael Henry) exposed cancer cells to fluid shear stress and measured the different gene expressions before and after the forces
- Cells exposed to stress denoted “Sheared” and those that weren’t denoted “Static”
Measured the gene expressions using RNA-sequencing at 3 hours, 12 hours, and 24 hours for 3 trials
sample size of \(n = 18\) for each specific gene
- (2 conditions) x (3 timestamps) x (3 trials)

RNA-Sequencing

Method of turning RNA material from tissues or cells into readable genomic data
1. Isolate RNA from inside the sample cell
2. Copy RNA to DNA with enzymes
3. DNA is sequenced with a machine to output the counts of RNA molecules in a sample cell
Very widely used in genetics research and helps us quantify gene expression within a specific sample
- more RNA molecules present = more gene expression

Research Question

How do the forces of blood flow affect the gene expression of cancer cells?

Analyzing 58735 genes, which take on roles like helping to make proteins, cell regulation, and non-coding material
Previous work suggests that cancer cells exposed to fluid shear stress tend to be better suited to metastasize (Leeuw et al. 2016)
Test whether certain genes are expressed significantly more or less under the stresses of blood flow, and whether the difference in gene expression might prepare cells for metastasis
How do we model & test this question?

Overdispersion

Poisson model popular for count data
However, assumes \(\text{mean}=\text{variance}\)
With RNA-Sequencing, this is rarely true
How to analyze these data?
- Negative binomial is an option
- Reduce to something familiar: linear models
- Limma is an R package for fitting linear models to RNA-Seq data

Data Issues

For fitting a linear model, the current data has some poor behavior.

Many genes have zero (or near zero) expression counts
There is a clear trend in the variance
Small sample sizes means little data to base gene-wise variability estimates off of

Handling Low Counts

Keep genes that have an expression count of at least \(10\) in at least half of the observations

⠀

Example of a gene filtered out:

Gene	Static3A	Static3B	Static3C	Static12A	Static12B	Static12C	Static24A	Static24B	Static24C	Sheared3A	Sheared3B	Sheared3C	Sheared12A	Sheared12B	Sheared12C	Sheared24A	Sheared24B	Sheared24C
ENSG00000278267	0	2	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0

Example of a gene kept in:

Gene	Static3A	Static3B	Static3C	Static12A	Static12B	Static12C	Static24A	Static24B	Static24C	Sheared3A	Sheared3B	Sheared3C	Sheared12A	Sheared12B	Sheared12C	Sheared24A	Sheared24B	Sheared24C
ENSG00000268903	62	72	51	103	102	120	95	138	260	53	43	61	94	108	98	81	104	106

⠀

Filtering reduces from \(58735\) genes to \(15401\) genes

Handling Variance with Voom

Use the voom function within limma
Transforms count data to \(\log_2\text{CPM}\) and estimates the mean-variance relationship
voom also computes weights for each observation based on precision
No variance trend after accounting for the weights from voom

Pooling Info with Empirical Bayes

Few observations means little data to base variability estimates off of
Use an Empirical Bayes method to pool information across genes
This gives a sense of how variability is distributed in the total gene population
The additional info allows for more accurate standard error estimates in model parameters

Linear Modeling

Interested in gene expression differences between static and sheared groups on average
Different genes may be expressed more or less at different time points
We fit the following linear model to each gene:

\[(\log_2\text{CPM})_i = \beta_{1,1} + \beta_{1,2}\text{(12H)}_i + \beta_{1,3}\text{(24H)}_i + \bigg(\beta_{2,1} + \beta_{2,2}\text{(12H)}_i + \beta_{2,3}\text{(24H)}_i\bigg)(\text{Sheared})_i + \varepsilon_i\]

Fold change quantifies the difference in means between sheared and static groups

\[\begin{align*} \text{3 Hours}&: \quad \log_2(\text{FC}) = \beta_{2,1}\\ \text{12 Hours}&: \quad \log_2(\text{FC}) = \beta_{2,1} + \beta_{2,2}\\ \text{24 Hours}&: \quad \log_2(\text{FC}) = \beta_{2,1} + \beta_{2,3} \end{align*}\]

Hypothesis Testing on Differences

The decideTests function of limma tests using a moderated t statistic of the form

\[\frac{\log_{2}(\text{FC})}{se(\log_2(\text{FC}))}\]

Can be interpreted the same as a standard t statistic, a difference divided by the standard error of the difference
The standard error here uses the pooled information from the Empirical Bayes
Significance is not determined from standard p-values
Adjusted p-values based on false discovery rate account for multiple hypothesis tests

Multiple Hypothesis Testing

For one test: significance level \(\alpha\) = \(P(\text{Type I Error})\)
One test per gene \(\implies\) Type I Error almost guaranteed
Corrections:
- Family-wise Error Rate (FWER)
  - Probability of at least one false positive
  - e.g., Bonferroni Correction
- False Discovery Rate (FDR)
  - Proportion of false positives among test
  - Controlled by Benjamini-Hochberg Procedure (BH)
BH used for higher power testing

Gene Set Enrichment Analysis

A statistical method used to determine whether a group of genes shows significant coordinated changes between two biological states
It is hard to look through multiple genes one by one to find patterns, so using gene sets avoids interpreting single genes
Provides insights into biological pathways and processes
A gene set is a group of genes with common biological functions
Gene regulation is process by which genes are turned on or off (whether or not they make proteins)

Results (3 Hour Group)

There were 84 significant genes.

Results (12 Hour Group)

There were 290 significant genes.

Results (24 Hour Group)

There were 254 significant genes.

Discussion

High-throughput data allows for reliable variability estimates even with small sample sizes
We were able to narrow down which gene pathways were significantly affected due to fluid shear stress
- Candidates for further analysis on their direct relation to metastasis
Time had a noticeable effect on gene expression
- 3H mostly up-regulated, 24H mostly down-regulated
- Specific significant gene pathways changed over time
Although the RNA-seq data provides a good foundation, biological expertise is necessary to conduct deeper analyses on each gene pathway

References

Breheney, Patrick. 2025a. “False Discovery Rates.” University of Iowa; University Lecture.

———. 2025b. “Family-Wise Error Rates.” University of Iowa; University Lecture.

Leeuw, Christiaan A de, Benjamin M Neale, Tom Heskes, and Danielle Posthuma. 2016. “The Statistical Properties of Gene-Set Analysis.” Nat. Rev. Genet. 17 (6): 353–64.

Nova, Ian C. n.d. “RNA-Seq (RNA Sequencing).” Genome.gov. https://www.genome.gov/genetics-glossary/RNA-seq.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Smyth, Gordon K. 2004. “Limma Moderated t-Statistics and b-Statistics.” https://support.bioconductor.org/p/6124.