Read clouds uncover variation in complex regions of the human genome.

Bishara, Alex; Liu, Yuling; Weng, Ziming; Kashef-Haghighi, Dorna; Newburger, Daniel E; West, Robert; Sidow, Arend; Batzoglou, Serafim

Bishara, Alex; Liu, Yuling; Weng, Ziming; Kashef-Haghighi, Dorna; Newburger, Daniel E; West, Robert; Sidow, Arend; Batzoglou, Serafim.

Afiliação

Bishara A; Department of Computer Science, Stanford University, Stanford, California 94305, USA;
Liu Y; Department of Computer Science, Stanford University, Stanford, California 94305, USA; Department of Chemistry, Stanford University, Stanford, California 94305, USA;
Weng Z; Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, USA;
Kashef-Haghighi D; Department of Computer Science, Stanford University, Stanford, California 94305, USA;
Newburger DE; Biomedical Informatics Training Program, Stanford, California 94305, USA;
West R; Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, USA;
Sidow A; Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, USA; Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA.
Batzoglou S; Department of Computer Science, Stanford University, Stanford, California 94305, USA;

Genome Res ; 25(10): 1570-80, 2015 Oct.

Article em En | MEDLINE | ID: mdl-26286554

ABSTRACT

ABSTRACT

Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies.

Assuntos

Variação Genética; Genoma Humano; Análise de Sequência de DNA/métodos; Algoritmos; Carcinoma Ductal/genética; Carcinoma Ductal de Mama/genética; Fragmentação do DNA; Humanos; Alinhamento de Sequência/métodos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Variação Genética / Genoma Humano / Análise de Sequência de DNA Limite: Humans Idioma: En Ano de publicação: 2015 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google