Basal Contamination of Sequencing: Lessons from the GTEx dataset
Tim O. Nieuwenhuis, Stephanie Yang, Rohan X. Verma, Vamsee Pillalamarri, Dan E. Arking, Avi Z. Rosenberg, Matthew N. McCall, Marc K. Halushka
Received Date: 19th September 19
One of the challenges of next generation sequencing (NGS) is read contamination. We used the Genotype-Tissue Expression (GTEx) project, a large, diverse, and robustly generated dataset, to understand the factors that contribute to contamination. We obtained GTEx datasets and technical metadata and validating RNA-Seq from other studies. Of 48 analyzed tissues in GTEx, 24 had variant co-expression clusters of four known highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and CELA3A). Fifteen additional highly expressed genes from other tissues also indicated contamination. Sample contamination by non-native genes was highly associated with a sample being sequenced on the same day as a tissue that natively expressed those genes. This was highly significant for pancreas and esophagus genes (p=2.7e-75 and p=8.9e-154 respectively). Nine SNPs in four genes shown to contaminate non-native tissues demonstrated allelic differences between DNA-based genotypes and contaminated sample RNA-based genotypes, validating the contamination. Low-level contamination affected 1,841 (15.8%) samples (defined as ≥500 PRSS1 read counts). It also led to eQTL assignments in inappropriate tissues among these 19 genes. We note this type of contamination occurs widely, impacting bulk and single cell data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses. Awareness of this process is necessary to avoid assigning inaccurate importance to low-level gene expression in inappropriate tissues and cells.
Read in full at bioRxiv.
This is an abstract of a preprint hosted on an independent third party site. It has not been peer reviewed but is currently under consideration at Nature Communications.