Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads

Huilong Du and Chengzhi Liang

Thumb 65bb0659497b85bae0759dc2a6b4b5db 400x400
Nov 27, 2018

Received Date: 5th November 18

Due to the large number of repetitive sequences in complex eukaryotic genomes, fragmented assemblies lose value as references genomes, often due to incomplete sequences and short contigs that cannot be anchored or mispositioned onto chromosomes. Here we report a novel method Highly Efficient Repeat Assembly (HERA), which includes a new concept called a connection graph as well as algorithms for constructing the graph. HERA resolves repeats at high efficiency with single-molecule sequencing data, and enables the assembly of chromosome-scale contigs by further integrating genome maps and Hi-C data. We tested HERA with the genomes of rice R498, maize B73, human HX1 and Tartary buckwheat Pinku1. HERA can correctly assemble most of the gapped regions including previously unreachable tandemly repetitive sequences in rice using single-molecule sequencing data only. The Pinku1 genome was assembled into 12 scaffolds with a contig N50 size of 27.85 Mb. Using the same maize and human sequencing data published by Jiao et al. (2017) and Shi et al. (2016), respectively, we dramatically improved on the sequence contiguity compared with the published assemblies, increasing the contig N50 from 1.3 Mb to 61.2 Mb in maize B73 assembly and from 8.3 Mb to 54.4 Mb in human HX1 assembly with HERA. We filled 96.9% of the gaps (only 76 left) and anchored additional genes on chromosomes. Comparisons between the HERA assembly of HX1 and the human GRCh38 reference genome showed that many gaps in GRCh38 could also be filled, and that GRCh38 contained some potential errors that could be fixed. HERA serves as a new genome assembly or phasing method to generate high quality sequences for complex genomes and to improve the contiguity and completeness of existing genomes, including the correction of assembly errors.

Read in full at bioRxiv.

This is an abstract of a preprint hosted on an independent third party site. It has not been peer reviewed but is currently under consideration at Nature Communications.

Medium 65bb0659497b85bae0759dc2a6b4b5db 400x400

Nature Communications

Nature Research, Springer Nature