Search | Nursing VHL Search Portal

1.

Digital genotyping of sorghum - a diverse plant species with a large repeat-rich genome.

Morishige, Daryl T; Klein, Patricia E; Hilley, Josie L; Sahraeian, Sayed Mohammad Ebrahim; Sharma, Arun; Mullet, John E.

BMC Genomics ; 14: 448, 2013 Jul 05.

Article in English | MEDLINE | ID: mdl-23829350

ABSTRACT

BACKGROUND: Rapid acquisition of accurate genotyping information is essential for all genetic marker-based studies. For species with relatively small genomes, complete genome resequencing is a feasible approach for genotyping; however, for species with large and highly repetitive genomes, the acquisition of whole genome sequences for the purpose of genotyping is still relatively inefficient and too expensive to be carried out on a high-throughput basis. Sorghum bicolor is a C4 grass with a sequenced genome size of ~730 Mb, of which ~80% is highly repetitive. We have developed a restriction enzyme targeted genome resequencing method for genetic analysis, termed Digital Genotyping (DG), to be applied to sorghum and other grass species with large repeat-rich genomes. RESULTS: DG templates are generated using one of three methylation sensitive restriction enzymes that recognize a nested set of 4, 6 or 8 bp GC-rich sequences, enabling varying depth of analysis and integration of results among assays. Variation in sequencing efficiency among DG markers was correlated with template GC-content and length. The expected DG allele sequence was obtained 97.3% of the time with a ratio of expected to alternative allele sequence acquisition of >20:1. A genetic map aligned to the sorghum genome sequence with an average resolution of 1.47 cM was constructed using 1,772 DG markers from 137 recombinant inbred lines. The DG map enhanced the detection of QTL for variation in plant height and precisely aligned QTL such as Dw3 to underlying genes/alleles. Higher-resolution NgoMIV-based DG haplotypes were used to trace the origin of DNA on SBI-06, spanning Ma1 and Dw2 from progenitors to BTx623 and IS3620C. DG marker analysis identified the correct location of two miss-assembled regions and located seven super contigs in the sorghum reference genome sequence. CONCLUSION: DG technology provides a cost-effective approach to rapidly generate accurate genotyping data in sorghum. Currently, data derived from DG are used for many marker-based analyses, including marker-assisted breeding, pedigree and QTL analysis, genetic map construction, map-based gene cloning and association studies. DG in combination with whole genome resequencing is dramatically accelerating all aspects of genetic analysis of sorghum, an important genetic reference for C4 grass species.

Subject(s)

Genome, Plant , Genotyping Techniques/methods , Sorghum/genetics , DNA Restriction Enzymes , DNA, Plant/genetics , Genetic Markers , Genotype , Quantitative Trait Loci , Sequence Analysis, DNA/methods

2.

RESQUE: network reduction using semi-Markov random walk scores for efficient querying of biological networks.

Sahraeian, Sayed Mohammad Ebrahim; Yoon, Byung-Jun.

Bioinformatics ; 28(16): 2129-36, 2012 Aug 15.

Article in English | MEDLINE | ID: mdl-22730436

ABSTRACT

MOTIVATION: Recent technological advances in measuring molecular interactions have resulted in an increasing number of large-scale biological networks. Translation of these enormous network data into meaningful biological insights requires efficient computational techniques that can unearth the biological information that is encoded in the networks. One such example is network querying, which aims to identify similar subnetwork regions in a large target network that are similar to a given query network. Network querying tools can be used to identify novel biological pathways that are homologous to known pathways, thereby enabling knowledge transfer across different organisms. RESULTS: In this article, we introduce an efficient algorithm for querying large-scale biological networks, called RESQUE. The proposed algorithm adopts a semi-Markov random walk (SMRW) model to probabilistically estimate the correspondence scores between nodes that belong to different networks. The target network is iteratively reduced based on the estimated correspondence scores, which are also iteratively re-estimated to improve accuracy until the best matching subnetwork emerges. We demonstrate that the proposed network querying scheme is computationally efficient, can handle any network query with an arbitrary topology and yields accurate querying results. AVAILABILITY: The source code of RESQUE is freely available at http://www.ece.tamu.edu/~bjyoon/RESQUE/

Subject(s)

Algorithms , Computational Biology/methods , Protein Interaction Mapping/methods , Software , Animals , Drosophila melanogaster , Gene Regulatory Networks , Humans , Markov Chains , Metabolic Networks and Pathways , Saccharomyces cerevisiae

3.

PicXAA-Web: a web-based platform for non-progressive maximum expected accuracy alignment of multiple biological sequences.

Sahraeian, Sayed Mohammad Ebrahim; Yoon, Byung-Jun.

Nucleic Acids Res ; 39(Web Server issue): W8-12, 2011 Jul.

Article in English | MEDLINE | ID: mdl-21515632

ABSTRACT

In this article, we introduce PicXAA-Web, a web-based platform for accurate probabilistic alignment of multiple biological sequences. The core of PicXAA-Web consists of PicXAA, a multiple protein/DNA sequence alignment algorithm, and PicXAA-R, an extension of PicXAA for structural alignment of RNA sequences. Both PicXAA and PicXAA-R are probabilistic non-progressive alignment algorithms that aim to find the optimal alignment of multiple biological sequences by maximizing the expected accuracy. PicXAA and PicXAA-R greedily build up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures local similarities among sequences. PicXAA-Web integrates these two algorithms in a user-friendly web platform for accurate alignment and analysis of multiple protein, DNA and RNA sequences. PicXAA-Web can be freely accessed at http://gsp.tamu.edu/picxaa/.

Subject(s)

Sequence Alignment/methods , Software , Algorithms , Internet , Reproducibility of Results , Sequence Analysis, DNA , Sequence Analysis, Protein , Sequence Analysis, RNA

4.

PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences.

Sahraeian, Sayed Mohammad Ebrahim; Yoon, Byung-Jun.

Nucleic Acids Res ; 38(15): 4917-28, 2010 Aug.

Article in English | MEDLINE | ID: mdl-20413579

ABSTRACT

Accurate tools for multiple sequence alignment (MSA) are essential for comparative studies of the function and structure of biological sequences. However, it is very challenging to develop a computationally efficient algorithm that can consistently predict accurate alignments for various types of sequence sets. In this article, we introduce PicXAA (Probabilistic Maximum Accuracy Alignment), a probabilistic non-progressive alignment algorithm that aims to find protein alignments with maximum expected accuracy. PicXAA greedily builds up the multiple alignment from sequence regions with high local similarities, thereby yielding an accurate global alignment that effectively grasps the local similarities among sequences. Evaluations on several widely used benchmark sets show that PicXAA constantly yields accurate alignment results on a wide range of reference sets, with especially remarkable improvements over other leading algorithms on sequence sets with local similarities. PicXAA source code is freely available at: http://www.ece.tamu.edu/~bjyoon/picxaa/.

Subject(s)

Algorithms , Sequence Alignment/methods , Sequence Analysis, Protein , Computational Biology , Probability

5.

Achieving robust somatic mutation detection with deep learning models derived from reference data sets of a cancer sample.

Sahraeian, Sayed Mohammad Ebrahim; Fang, Li Tai; Karagiannis, Konstantinos; Moos, Malcolm; Smith, Sean; Santana-Quintero, Luis; Xiao, Chunlin; Colgan, Michael; Hong, Huixiao; Mohiyuddin, Marghoob; Xiao, Wenming.

Genome Biol ; 23(1): 12, 2022 01 07.

Article in English | MEDLINE | ID: mdl-34996510

ABSTRACT

BACKGROUND: Accurate detection of somatic mutations is challenging but critical in understanding cancer formation, progression, and treatment. We recently proposed NeuSomatic, the first deep convolutional neural network-based somatic mutation detection approach, and demonstrated performance advantages on in silico data. RESULTS: In this study, we use the first comprehensive and well-characterized somatic reference data sets from the SEQC2 consortium to investigate best practices for using a deep learning framework in cancer mutation detection. Using the high-confidence somatic mutations established for a cancer cell line by the consortium, we identify the best strategy for building robust models on multiple data sets derived from samples representing real scenarios, for example, a model trained on a combination of real and spike-in mutations had the highest average performance. CONCLUSIONS: The strategy identified in our study achieved high robustness across multiple sequencing technologies for fresh and FFPE DNA input, varying tumor/normal purities, and different coverages, with significant superiority over conventional detection approaches in general, as well as in challenging situations such as low coverage, low variant allele frequency, DNA damage, and difficult genomic regions.

Subject(s)

Deep Learning , Neoplasms , Genomics , Humans , Mutation , Neoplasms/genetics , Neural Networks, Computer

6.

Curated variation benchmarks for challenging medically relevant autosomal genes.

Wagner, Justin; Olson, Nathan D; Harris, Lindsay; McDaniel, Jennifer; Cheng, Haoyu; Fungtammasan, Arkarachai; Hwang, Yih-Chii; Gupta, Richa; Wenger, Aaron M; Rowell, William J; Khan, Ziad M; Farek, Jesse; Zhu, Yiming; Pisupati, Aishwarya; Mahmoud, Medhat; Xiao, Chunlin; Yoo, Byunggil; Sahraeian, Sayed Mohammad Ebrahim; Miller, Danny E; Jáspez, David; Lorenzo-Salazar, José M; Muñoz-Barrera, Adrián; Rubio-Rodríguez, Luis A; Flores, Carlos; Narzisi, Giuseppe; Evani, Uday Shanker; Clarke, Wayne E; Lee, Joyce; Mason, Christopher E; Lincoln, Stephen E; Miga, Karen H; Ebbert, Mark T W; Shumate, Alaina; Li, Heng; Chin, Chen-Shan; Zook, Justin M; Sedlazeck, Fritz J.

Nat Biotechnol ; 40(5): 672-680, 2022 05.

Article in English | MEDLINE | ID: mdl-35132260

ABSTRACT

The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.

Subject(s)

Genome, Human , Genome, Human/genetics , Haplotypes/genetics , Humans , Sequence Analysis, DNA

7.

Assessing reproducibility of inherited variants detected with short-read whole genome sequencing.

Pan, Bohu; Ren, Luyao; Onuchic, Vitor; Guan, Meijian; Kusko, Rebecca; Bruinsma, Steve; Trigg, Len; Scherer, Andreas; Ning, Baitang; Zhang, Chaoyang; Glidewell-Kenney, Christine; Xiao, Chunlin; Donaldson, Eric; Sedlazeck, Fritz J; Schroth, Gary; Yavas, Gokhan; Grunenwald, Haiying; Chen, Haodong; Meinholz, Heather; Meehan, Joe; Wang, Jing; Yang, Jingcheng; Foox, Jonathan; Shang, Jun; Miclaus, Kelci; Dong, Lianhua; Shi, Leming; Mohiyuddin, Marghoob; Pirooznia, Mehdi; Gong, Ping; Golshani, Rooz; Wolfinger, Russ; Lababidi, Samir; Sahraeian, Sayed Mohammad Ebrahim; Sherry, Steve; Han, Tao; Chen, Tao; Shi, Tieliu; Hou, Wanwan; Ge, Weigong; Zou, Wen; Guo, Wenjing; Bao, Wenjun; Xiao, Wenzhong; Fan, Xiaohui; Gondo, Yoichi; Yu, Ying; Zhao, Yongmei; Su, Zhenqiang; Liu, Zhichao.

Genome Biol ; 23(1): 2, 2022 01 03.

Article in English | MEDLINE | ID: mdl-34980216

ABSTRACT

BACKGROUND: Reproducible detection of inherited variants with whole genome sequencing (WGS) is vital for the implementation of precision medicine and is a complicated process in which each step affects variant call quality. Systematically assessing reproducibility of inherited variants with WGS and impact of each step in the process is needed for understanding and improving quality of inherited variants from WGS. RESULTS: To dissect the impact of factors involved in detection of inherited variants with WGS, we sequence triplicates of eight DNA samples representing two populations on three short-read sequencing platforms using three library kits in six labs and call variants with 56 combinations of aligners and callers. We find that bioinformatics pipelines (callers and aligners) have a larger impact on variant reproducibility than WGS platform or library preparation. Single-nucleotide variants (SNVs), particularly outside difficult-to-map regions, are more reproducible than small insertions and deletions (indels), which are least reproducible when > 5 bp. Increasing sequencing coverage improves indel reproducibility but has limited impact on SNVs above 30×. CONCLUSIONS: Our findings highlight sources of variability in variant detection and the need for improvement of bioinformatics pipelines in the era of precision medicine with WGS.

Subject(s)

Genome, Human , Polymorphism, Single Nucleotide , High-Throughput Nucleotide Sequencing , Humans , INDEL Mutation , Reproducibility of Results , Whole Genome Sequencing

8.

PicXAA-R: efficient structural alignment of multiple RNA sequences using a greedy approach.

Sahraeian, Sayed Mohammad Ebrahim; Yoon, Byung-Jun.

BMC Bioinformatics ; 12 Suppl 1: S38, 2011 Feb 15.

Article in English | MEDLINE | ID: mdl-21342569

ABSTRACT

BACKGROUND: Accurate and efficient structural alignment of non-coding RNAs (ncRNAs) has grasped more and more attentions as recent studies unveiled the significance of ncRNAs in living organisms. While the Sankoff style structural alignment algorithms cannot efficiently serve for multiple sequences, mostly progressive schemes are used to reduce the complexity. However, this idea tends to propagate the early stage errors throughout the entire process, thereby degrading the quality of the final alignment. For multiple protein sequence alignment, we have recently proposed PicXAA which constructs an accurate alignment in a non-progressive fashion. RESULTS: Here, we propose PicXAA-R as an extension to PicXAA for greedy structural alignment of ncRNAs. PicXAA-R efficiently grasps both folding information within each sequence and local similarities between sequences. It uses a set of probabilistic consistency transformations to improve the posterior base-pairing and base alignment probabilities using the information of all sequences in the alignment. Using a graph-based scheme, we greedily build up the structural alignment from sequence regions with high base-pairing and base alignment probabilities. CONCLUSIONS: Several experiments on datasets with different characteristics confirm that PicXAA-R is one of the fastest algorithms for structural alignment of multiple RNAs and it consistently yields accurate alignment results, especially for datasets with locally similar sequences. PicXAA-R source code is freely available at: http://www.ece.tamu.edu/~bjyoon/picxaa/.

Subject(s)

Algorithms , Models, Statistical , RNA, Untranslated/chemistry , Sequence Alignment/methods , Sequence Analysis, RNA/methods , Base Pairing , Computational Biology/methods

9.

Enhancing the accuracy of HMM-based conserved pathway prediction using global correspondence scores.

Qian, Xiaoning; Sahraeian, Sayed Mohammad Ebrahim; Yoon, Byung-Jun.

BMC Bioinformatics ; 12 Suppl 10: S6, 2011 Oct 18.

Article in English | MEDLINE | ID: mdl-22165903

ABSTRACT

BACKGROUND: Comparative network analysis aims to identify common subnetworks in biological networks. It can facilitate the prediction of conserved functional modules across different species and provide deep insights into their underlying regulatory mechanisms. Recently, it has been shown that hidden Markov models (HMMs) can provide a flexible and computationally efficient framework for modeling and comparing biological networks. RESULTS: In this work, we show that using global correspondence scores between molecules can improve the accuracy of the HMM-based network alignment results. The global correspondence scores are computed by performing a semi-Markov random walk on the networks to be compared. The resulting score naturally integrates the sequence similarity between molecules and the topological similarity between their molecular interactions, thereby providing a more effective measure for estimating the functional similarity between molecules. By incorporating the global correspondence scores, instead of relying on sequence similarity or functional annotation scores used by previous approaches, our HMM-based network alignment method can identify conserved subnetworks that are functionally more coherent. CONCLUSIONS: Performance analysis based on synthetic and microbial networks demonstrates that the proposed network alignment strategy significantly improves the robustness and specificity of the predicted alignment results, in terms of conserved functional similarity measured based on KEGG ortholog (KO) groups. These results clearly show that the HMM-based network alignment framework using global correspondence scores can effectively find conserved biological pathways and has the potential to be used for automatic functional annotation of biomolecules.

Subject(s)

Algorithms , Bacteria/metabolism , Markov Chains , Bacterial Proteins/metabolism , Models, Biological , Protein Interaction Maps , Sequence Alignment

10.

Hidden biases in germline structural variant detection.

Khayat, Michael M; Sahraeian, Sayed Mohammad Ebrahim; Zarate, Samantha; Carroll, Andrew; Hong, Huixiao; Pan, Bohu; Shi, Leming; Gibbs, Richard A; Mohiyuddin, Marghoob; Zheng, Yuanting; Sedlazeck, Fritz J.

Genome Biol ; 22(1): 347, 2021 12 20.

Article in English | MEDLINE | ID: mdl-34930391

ABSTRACT

BACKGROUND: Genomic structural variations (SV) are important determinants of genotypic and phenotypic changes in many organisms. However, the detection of SV from next-generation sequencing data remains challenging. RESULTS: In this study, DNA from a Chinese family quartet is sequenced at three different sequencing centers in triplicate. A total of 288 derivative data sets are generated utilizing different analysis pipelines and compared to identify sources of analytical variability. Mapping methods provide the major contribution to variability, followed by sequencing centers and replicates. Interestingly, SV supported by only one center or replicate often represent true positives with 47.02% and 45.44% overlapping the long-read SV call set, respectively. This is consistent with an overall higher false negative rate for SV calling in centers and replicates compared to mappers (15.72%). Finally, we observe that the SV calling variability also persists in a genotyping approach, indicating the impact of the underlying sequencing and preparation approaches. CONCLUSIONS: This study provides the first detailed insights into the sources of variability in SV identification from next-generation sequencing and highlights remaining challenges in SV calling for large cohorts. We further give recommendations on how to reduce SV calling variability and the choice of alignment methodology.

Subject(s)

Genomic Structural Variation , Genomics/methods , Germ Cells , High-Throughput Nucleotide Sequencing/methods , Base Sequence , Bias , Chromosome Mapping , Sequence Analysis, DNA

11.

A verified genomic reference sample for assessing performance of cancer panels detecting small variants of low allele frequency.

Jones, Wendell; Gong, Binsheng; Novoradovskaya, Natalia; Li, Dan; Kusko, Rebecca; Richmond, Todd A; Johann, Donald J; Bisgin, Halil; Sahraeian, Sayed Mohammad Ebrahim; Bushel, Pierre R; Pirooznia, Mehdi; Wilkins, Katherine; Chierici, Marco; Bao, Wenjun; Basehore, Lee Scott; Lucas, Anne Bergstrom; Burgess, Daniel; Butler, Daniel J; Cawley, Simon; Chang, Chia-Jung; Chen, Guangchun; Chen, Tao; Chen, Yun-Ching; Craig, Daniel J; Del Pozo, Angela; Foox, Jonathan; Francescatto, Margherita; Fu, Yutao; Furlanello, Cesare; Giorda, Kristina; Grist, Kira P; Guan, Meijian; Hao, Yingyi; Happe, Scott; Hariani, Gunjan; Haseley, Nathan; Jasper, Jeff; Jurman, Giuseppe; Kreil, David Philip; Labaj, Pawel; Lai, Kevin; Li, Jianying; Li, Quan-Zhen; Li, Yulong; Li, Zhiguang; Liu, Zhichao; López, Mario Solís; Miclaus, Kelci; Miller, Raymond; Mittal, Vinay K.

Genome Biol ; 22(1): 111, 2021 04 16.

Article in English | MEDLINE | ID: mdl-33863366

ABSTRACT

BACKGROUND: Oncopanel genomic testing, which identifies important somatic variants, is increasingly common in medical practice and especially in clinical trials. Currently, there is a paucity of reliable genomic reference samples having a suitably large number of pre-identified variants for properly assessing oncopanel assay analytical quality and performance. The FDA-led Sequencing and Quality Control Phase 2 (SEQC2) consortium analyze ten diverse cancer cell lines individually and their pool, termed Sample A, to develop a reference sample with suitably large numbers of coding positions with known (variant) positives and negatives for properly evaluating oncopanel analytical performance. RESULTS: In reference Sample A, we identify more than 40,000 variants down to 1% allele frequency with more than 25,000 variants having less than 20% allele frequency with 1653 variants in COSMIC-related genes. This is 5-100× more than existing commercially available samples. We also identify an unprecedented number of negative positions in coding regions, allowing statistical rigor in assessing limit-of-detection, sensitivity, and precision. Over 300 loci are randomly selected and independently verified via droplet digital PCR with 100% concordance. Agilent normal reference Sample B can be admixed with Sample A to create new samples with a similar number of known variants at much lower allele frequency than what exists in Sample A natively, including known variants having allele frequency of 0.02%, a range suitable for assessing liquid biopsy panels. CONCLUSION: These new reference samples and their admixtures provide superior capability for performing oncopanel quality control, analytical accuracy, and validation for small to large oncopanels and liquid biopsy assays.

Subject(s)

Alleles , Biomarkers, Tumor , Gene Frequency , Genetic Testing/methods , Genetic Variation , Genomics/methods , Neoplasms/genetics , Cell Line, Tumor , DNA Copy Number Variations , Genetic Heterogeneity , Genetic Testing/standards , Genomics/standards , Humans , Neoplasms/diagnosis , Workflow

12.

Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing.

Fang, Li Tai; Zhu, Bin; Zhao, Yongmei; Chen, Wanqiu; Yang, Zhaowei; Kerrigan, Liz; Langenbach, Kurt; de Mars, Maryellen; Lu, Charles; Idler, Kenneth; Jacob, Howard; Zheng, Yuanting; Ren, Luyao; Yu, Ying; Jaeger, Erich; Schroth, Gary P; Abaan, Ogan D; Talsania, Keyur; Lack, Justin; Shen, Tsai-Wei; Chen, Zhong; Stanbouly, Seta; Tran, Bao; Shetty, Jyoti; Kriga, Yuliya; Meerzaman, Daoud; Nguyen, Cu; Petitjean, Virginie; Sultan, Marc; Cam, Margaret; Mehta, Monika; Hung, Tiffany; Peters, Eric; Kalamegham, Rasika; Sahraeian, Sayed Mohammad Ebrahim; Mohiyuddin, Marghoob; Guo, Yunfei; Yao, Lijing; Song, Lei; Lam, Hugo Y K; Drabek, Jiri; Vojta, Petr; Maestro, Roberta; Gasparotto, Daniela; Kõks, Sulev; Reimann, Ene; Scherer, Andreas; Nordlund, Jessica; Liljedahl, Ulrika; Jensen, Roderick V.

Nat Biotechnol ; 39(9): 1151-1160, 2021 09.

Article in English | MEDLINE | ID: mdl-34504347

ABSTRACT

The lack of samples for generating standardized DNA datasets for setting up a sequencing pipeline or benchmarking the performance of different algorithms limits the implementation and uptake of cancer genomics. Here, we describe reference call sets obtained from paired tumor-normal genomic DNA (gDNA) samples derived from a breast cancer cell line-which is highly heterogeneous, with an aneuploid genome, and enriched in somatic alterations-and a matched lymphoblastoid cell line. We partially validated both somatic mutations and germline variants in these call sets via whole-exome sequencing (WES) with different sequencing platforms and targeted sequencing with >2,000-fold coverage, spanning 82% of genomic regions with high confidence. Although the gDNA reference samples are not representative of primary cancer cells from a clinical sample, when setting up a sequencing pipeline, they not only minimize potential biases from technologies, assays and informatics but also provide a unique resource for benchmarking 'tumor-only' or 'matched tumor-normal' analyses.

Subject(s)

Benchmarking , Breast Neoplasms/genetics , DNA Mutational Analysis/standards , High-Throughput Nucleotide Sequencing/standards , Whole Genome Sequencing/standards , Cell Line, Tumor , Datasets as Topic , Germ Cells , Humans , Mutation , Reference Standards , Reproducibility of Results

13.

Deep convolutional neural networks for accurate somatic mutation detection.

Sahraeian, Sayed Mohammad Ebrahim; Liu, Ruolin; Lau, Bayo; Podesta, Karl; Mohiyuddin, Marghoob; Lam, Hugo Y K.

Nat Commun ; 10(1): 1041, 2019 03 04.

Article in English | MEDLINE | ID: mdl-30833567

ABSTRACT

Accurate detection of somatic mutations is still a challenge in cancer analysis. Here we present NeuSomatic, the first convolutional neural network approach for somatic mutation detection, which significantly outperforms previous methods on different sequencing platforms, sequencing strategies, and tumor purities. NeuSomatic summarizes sequence alignments into small matrices and incorporates more than a hundred features to capture mutation signals effectively. It can be used universally as a stand-alone somatic mutation detection method or with an ensemble of existing methods to achieve the highest accuracy.

Subject(s)

Computational Biology/methods , DNA Mutational Analysis/methods , Machine Learning , Mutation , Neural Networks, Computer , Computational Biology/instrumentation , DNA Mutational Analysis/instrumentation , Databases, Genetic , Diploidy , Exome , Genes, Neoplasm , Humans , Neoplasms/genetics , Sequence Alignment , Sequence Analysis, DNA/instrumentation , Sequence Analysis, DNA/methods

14.

Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis.

Sahraeian, Sayed Mohammad Ebrahim; Mohiyuddin, Marghoob; Sebra, Robert; Tilgner, Hagen; Afshar, Pegah T; Au, Kin Fai; Bani Asadi, Narges; Gerstein, Mark B; Wong, Wing Hung; Snyder, Michael P; Schadt, Eric; Lam, Hugo Y K.

Nat Commun ; 8(1): 59, 2017 07 05.

Article in English | MEDLINE | ID: mdl-28680106

ABSTRACT

RNA-sequencing (RNA-seq) is an essential technique for transcriptome studies, hundreds of analysis tools have been developed since it was debuted. Although recent efforts have attempted to assess the latest available tools, they have not evaluated the analysis workflows comprehensively to unleash the power within RNA-seq. Here we conduct an extensive study analysing a broad spectrum of RNA-seq workflows. Surpassing the expression analysis scope, our work also includes assessment of RNA variant-calling, RNA editing and RNA fusion detection techniques. Specifically, we examine both short- and long-read RNA-seq technologies, 39 analysis tools resulting in ~120 combinations, and ~490 analyses involving 15 samples with a variety of germline, cancer and stem cell data sets. We report the performance and propose a comprehensive RNA-seq analysis protocol, named RNACocktail, along with a computational pipeline achieving high accuracy. Validation on different samples reveals that our proposed protocol could help researchers extract more biologically relevant predictions by broad analysis of the transcriptome.RNA-seq is widely used for transcriptome analysis. Here, the authors analyse a wide spectrum of RNA-seq workflows and present a comprehensive analysis protocol named RNACocktail as well as a computational pipeline leveraging the widely used tools for accurate RNA-seq analysis.

Subject(s)

Embryonic Stem Cells , Transcriptome , Base Sequence , Cell Line , Humans

15.

PicXAA: a probabilistic scheme for finding the maximum expected accuracy alignment of multiple biological sequences.

Sahraeian, Sayed Mohammad Ebrahim; Yoon, Byung-Jun.

Methods Mol Biol ; 1079: 203-10, 2014.

Article in English | MEDLINE | ID: mdl-24170404

ABSTRACT

PicXAA is a probabilistic nonprogressive alignment algorithm that finds protein (or DNA) multiple sequence alignments with maximum expected accuracy. PicXAA greedily builds up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures the local similarities across sequences. PicXAA constantly yields accurate alignment results on a wide range of reference sets that have different characteristics, with especially remarkable improvements over other leading algorithms on sequence sets with high local similarities. In this chapter, we describe the overall alignment strategy used in PicXAA and discuss several important considerations for effective deployment of the algorithm.

Subject(s)

Algorithms , Computational Biology/methods , Sequence Alignment/methods , DNA/genetics , Probability , Proteins/chemistry , Quality Control

16.

SMETANA: accurate and scalable algorithm for probabilistic alignment of large-scale biological networks.

Sahraeian, Sayed Mohammad Ebrahim; Yoon, Byung-Jun.

PLoS One ; 8(7): e67995, 2013.

Article in English | MEDLINE | ID: mdl-23874484

ABSTRACT

In this paper we introduce an efficient algorithm for alignment of multiple large-scale biological networks. In this scheme, we first compute a probabilistic similarity measure between nodes that belong to different networks using a semi-Markov random walk model. The estimated probabilities are further enhanced by incorporating the local and the cross-species network similarity information through the use of two different types of probabilistic consistency transformations. The transformed alignment probabilities are used to predict the alignment of multiple networks based on a greedy approach. We demonstrate that the proposed algorithm, called SMETANA, outperforms many state-of-the-art network alignment techniques, in terms of computational efficiency, alignment accuracy, and scalability. Our experiments show that SMETANA can easily align tens of genome-scale networks with thousands of nodes on a personal computer without any difficulty. The source code of SMETANA is available upon request. The source code of SMETANA can be downloaded from http://www.ece.tamu.edu/~bjyoon/SMETANA/.

Subject(s)

Algorithms , Computational Biology/methods , Probability , Protein Interaction Maps , Conserved Sequence , Databases, Protein , Sequence Homology, Amino Acid , Time Factors

17.

Genetic structure and linkage disequilibrium in a diverse, representative collection of the C4 model plant, Sorghum bicolor.

Wang, Yi-Hong; Upadhyaya, Hari D; Burrell, A Millie; Sahraeian, Sayed Mohammad Ebrahim; Klein, Robert R; Klein, Patricia E.

G3 (Bethesda) ; 3(5): 783-93, 2013 May 20.

Article in English | MEDLINE | ID: mdl-23704283

ABSTRACT

To facilitate the mapping of genes in sorghum [Sorghum bicolor (L.) Moench] underlying economically important traits, we analyzed the genetic structure and linkage disequilibrium in a sorghum mini core collection of 242 landraces with 13,390 single-nucleotide polymorphims. The single-nucleotide polymorphisms were produced using a highly multiplexed genotyping-by-sequencing methodology. Genetic structure was established using principal component, Neighbor-Joining phylogenetic, and Bayesian cluster analyses. These analyses indicated that the mini-core collection was structured along both geographic origin and sorghum race classification. Examples of the former were accessions from Southern Africa, East Asia, and Yemen. Examples of the latter were caudatums with widespread geographical distribution, durras from India, and guineas from West Africa. Race bicolor, the most primitive and the least clearly defined sorghum race, clustered among other races and formed only one clear bicolor-centric cluster. Genome-wide linkage disequilibrium analyses showed linkage disequilibrium decayed, on average, within 10-30 kb, whereas the short arm of SBI-06 contained a linkage disequilibrium block of 20.33 Mb, confirming a previous report of low recombination on this chromosome arm. Four smaller but equally significant linkage disequilibrium blocks of 3.5-35.5 kb were detected on chromosomes 1, 2, 9, and 10. We examined the genes encoded within each block to provide a first look at candidates such as homologs of GS3 and FT that may indicate a selective sweep during sorghum domestication.

Subject(s)

Genetic Variation , Linkage Disequilibrium/genetics , Models, Biological , Sorghum/genetics , Carbon/metabolism , Chromosomes, Plant/genetics , Ecotype , Euchromatin/metabolism , Genes, Plant/genetics , Genotyping Techniques , Heterochromatin/metabolism , Phylogeny , Population Dynamics , Principal Component Analysis

18.

A network synthesis model for generating protein interaction network families.

Sahraeian, Sayed Mohammad Ebrahim; Yoon, Byung-Jun.

PLoS One ; 7(8): e41474, 2012.

Article in English | MEDLINE | ID: mdl-22912671

ABSTRACT

In this work, we introduce a novel network synthesis model that can generate families of evolutionarily related synthetic protein-protein interaction (PPI) networks. Given an ancestral network, the proposed model generates the network family according to a hypothetical phylogenetic tree, where the descendant networks are obtained through duplication and divergence of their ancestors, followed by network growth using network evolution models. We demonstrate that this network synthesis model can effectively create synthetic networks whose internal and cross-network properties closely resemble those of real PPI networks. The proposed model can serve as an effective framework for generating comprehensive benchmark datasets that can be used for reliable performance assessment of comparative network analysis algorithms. Using this model, we constructed a large-scale network alignment benchmark, called NAPAbench, and evaluated the performance of several representative network alignment algorithms. Our analysis clearly shows the relative performance of the leading network algorithms, with their respective advantages and disadvantages. The algorithm and source code of the network synthesis model and the network alignment benchmark NAPAbench are publicly available at http://www.ece.tamu.edu/bjyoon/NAPAbench/.

Subject(s)

Computational Biology/methods , Models, Statistical , Protein Interaction Maps , Algorithms , Animals , Benchmarking , Evolution, Molecular , Humans , Mice , Phylogeny , Sequence Homology, Amino Acid

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL