RNA Seq Bioinformatics Pipeline Overview

Canopy Biosciences employs a custom analysis pipeline to analyze RNA Seq data. A comprehensive report is compiled from the pipeline output and all data including sequencing fastq files, volcano plots and heat maps, and pathway analysis is supplied to the customer via a cloud link.

RNA Sequencing and Analysis

Samples were prepared according to library kit manufacturer’s protocol, indexed, pooled, and sequenced on an Illumina HiSeq. Basecalls and demultiplexing were performed with Illumina’s bcl2fastq software and a custom python demultiplexing program with a maximum of one mismatch in the indexing read. RNA-seq reads were then aligned to the Ensembl release 76 top-level assembly with STAR version 2.0.4b. Gene counts were derived from the number of uniquely aligned unambiguous reads by Subread:featureCount version 1.4.5. Isoform expression of known Ensembl transcripts were estimated with Sailfish version 0.6.13. Sequencing performance was assessed for the total number of aligned reads, total number of uniquely aligned reads, and features detected. The ribosomal fraction, known junction saturation, and read distribution over known gene models were quantified with RSeQC version 2.3.

All gene counts were then imported into the R/Bioconductor package EdgeR and TMM normalization size factors were calculated to adjust for samples for differences in library size. Ribosomal genes and genes not expressed in the smallest group size minus one samples greater than one count-per-million were excluded from further analysis. The TMM size factors and the matrix of counts were then imported into the R/Bioconductor package Limma. Weighted likelihoods based on the observed mean-variance relationship of every gene and sample were then calculated for all samples with the voomWithQualityWeights. The performance of all genes was assessed with plots of the residual standard deviation of every gene to their average log-count with a robustly fitted trend line of the residuals. Differential expression analysis was then performed to analyze for differences between conditions and the results were filtered for only those genes with Benjamini-Hochberg false-discovery rate adjusted p-values less than or equal to 0.05.

For each contrast extracted with Limma, global perturbations in known Gene Ontology (GO) terms and KEGG pathways were detected using the R/Bioconductor package GAGE to test for changes in expression of the reported log 2 fold-changes reported by Limma in each term versus the background log 2 fold-changes of all genes found outside the respective term. The R/Bioconductor package heatmap and Pathview was used to display heatmaps or annotated KEGG graphs across groups of samples for each GO term or KEGG pathway (respectively) with a Benjamini-Hochberg false-discovery rate adjusted p-value less than or equal to 0.05.

To find the most critical genes, the raw counts were variance stabilized with the R/Bioconductor package DESeq2 and was then analyzed via weighted gene correlation network analysis with the R/Bioconductor package WGCNA. Briefly, all genes were correlated across each other by Pearson correlations and clustered by expression similarity into unsigned modules using a power threshold empirically determined from the data. An eigengene was then created for each de novo cluster and its expression profile was then correlated across all coefficients of the model matrix. Because these clusters of genes were created by expression profile rather than known functional similarity, the clustered modules were given the names of random colors where grey is the only module that has any pre-existing definition of containing genes that do not cluster well with others. These de-novo clustered genes were then tested for functional enrichment of known GO terms with hypergeometric tests available in the R/Bioconductor package clusterProfiler. Significant terms with Benjamini-Hochberg adjusted p-values less than 0.05 were then collapsed by similarity into clusterProfiler category network plots to display the most significant terms for each module of hub genes in order to interpolate the function of each significant module. The information for all clustered genes for each module were then combined with their respective statistical significance results from Limma to determine whether or not those features were also found to be significantly differentially expressed.